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Preface 



Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) are the most 
frequently used tools for programming according to the message passing paradigm, 
which is considered one of the best ways to develop parallel applications. 

This volume comprises 42 revised contributions presented at the Seventh 
European PVM/MPI Users' Group Meeting, which was held in Balatonfiired, 
Hungary, 10-13 September 2000. The conference was organized by the Laboratory of 
Parallel and Distributed Systems of the Computer and Automation Research Institute 
of the Hungarian Academy of Sciences. 

This conference was previously held in Barcelona, Spain (1999), Liverpool, 
UK (1998) and Cracow, Poland (1997). The first three conferences were devoted to 
PVM and were held at the Technische Universitat Miinchen, Germany (1996), Ecole 
Normale Superieure Lyon, France (1995), and University of Rome, Italy (1994). 

This conference has become a forum for users and developers of PVM, MPI, 
and other message passing environments. Interaction between those groups has 
proved to be very useful for developing new ideas in parallel computing and for 
applying existing ideas to new practical fields. The main topics of the meeting were 
evaluation and performance of PVM and MPI, extensions and improvements to PVM 
and MPI, algorithms using the message passing paradigm, and applications in science 
and engineering based on message passing. The conference included four tutorials 
and five invited talks on advances in MPI, cluster computing, network computing, 
grid computing, and SGI parallel computers and programming systems. These 
proceedings contain papers on the 35 oral presentations together with 7 poster 
presentations. 

The seventh Euro PVM/MPI conference was held together with DAPSYS 
2000, the third Austrian-Hungarian Workshop on Distributed and Parallel Systems. 
Participants of the two events shared invited talks, tutorials, a vendor session and 
social events while contributed paper presentations proceeded in separate tracks in 
parallel. While Euro PVM/MPI was dedicated to the latest developments of PVM and 
MPI, DAPSYS was a major event to discuss general aspects of distributed and 
parallel systems. In this way the two events complemented each other and participants 
of Euro PVM/MPI could benefit from the joint organization of the two events. 

Invited speakers of Euro PVM/MPI were A1 Geist, Miron Livny, Ewing Lusk, 
Thomas Sterling, and Bernard Tourancheau. 

We would like to express our gratitude for the kind support of Silicon 
Computers, Microsoft, Myricom, and the Foundation for the Technological Progress 
of the Industry. Also, we would like to say thanks to the members of the Program 
Committee for their work in refereeing the submitted papers and ensuring the high 
quality of Euro PVM/MPI. 
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PVM and MPI: What Else Is Needed for 
Cluster Computing? 



A1 Geist 

Oak Ridge National Laboratory, 
PO Box 2008, 

Oak Ridge, TN 37831-6367 
gstOornl . gov 

http : //www . csm . ornl . gov/ ~geist 



Abstract. As we start the new millennium, let us first look back over 
the previous ten years of PVM use (and five years of MPI use) and 
explore how parallel computing has evolved from Grays and networks of 
workstations to Commodity-off-the-shelf (COTS) clusters. During this 
evolution, schedulers, monitors, and resource managers were added on 
top of PVM and MPI. This talk looks forward and predicts what software 
besides PVM and MPI will be needed to effectively exploit the cluster 
computing of the next ten years. 



1 The First 10 Years 

When the first PVM application was developed back in 1990 to study high- 
temperature superconductivity, the most common PVM platform was a network 
of workstations (NOW). This application was the first of several applications 
to win Gordon Bell prizes using PVM during the last decade. Heterogeneous 
NOWs were just starting to be exploited in the early 90s. Parallel computer 
companies came and went with lifetimes of a few years. Gomputer architectures 
varied widely from company to company making it difficult to develop scientific 
applications that did not have to be rewritten for each new architecture. During 
this period PVM provided a stable middleware layer on which to build parallel 
applications. PVM even today takes care of the details of a given architecture or 
NOW while presenting a simple set of functions to the application. In the mid- 
to-late 90s, MPI was created to provide a common message-passing interface for 
parallel computers and thus improving the portability of scientific applications. 

Neither PVM nor MPI provided for all the needs of the users and system 
administrators. Soon researchers were developing schedulers, performance mon- 
itors, resource managers, debuggers, and performance enhancements for these 
environments. Once again it became hard for application developers to know 
what was the best combination of software packages to use for their science. 
Meanwhile the variability in computer architectures died away and the remain- 
ing vendors converged on clusters of commodity computers. 
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2 Cluster Computing Today 

Today the fastest growing market for parallel computers is the cluster of PCs. 
The key turning point was when PCs became as powerful as workstations. Pop- 
ularized by the Beowulf project at NASA in the late 1990s, clusters of PCs are 
much less expensive than their cousins the IBM SP and Compaq Sierra. Once 
again PVM and MPI are there to provide a standard programming interface, 
but large PC clusters lack all the software tools needed for system administra- 
tion, I/O, parallel file systems, etc. Huge gaps exist between the usability of a 
vendor’s cluster and a homegrown cluster of PCs. 

At ORNL we have large SP and Sierra systems as well as a 128 processor 
Pentium III Linux cluster. This talk will describe and demonstrate the latest 
cluster management software being developed at ORNL. The software is called 
M3C (Managing and monitoring multiple clusters) and C3 (Cluster Command 
and Control) . In addition the talk will describe an new effort by Intel, IBM, SGI, 
ORNL, NCSA, and several others to create a Community Cluster Development 
Kit to be distributed as open source. 

3 Next 10 Years 

This talk will project forward into the first decade of the 21st century and discuss 
the future needs and potential directions for cluster computing applications. We 
will cover three topics: extensible distributed computing environments, fault tol- 
erance with adaptive recovery, and desirable collaboration features in distributed 
virtual machines. 

Collaboration is growing in importance in distributed environments as scien- 
tific problems get more complex and experts from around the world are involved 
in experiments. Projects like Cumulvs at ORNL, which allows remote scientists 
to dynamically attach, visualize, and steer long running simulations, point out 
the shortcomings of PVM and MPI. While the APIs in PVM and MPI may still 
be the standards in the coming years, the distributed environment that supplies 
these APIs will need to provide the ability for groups of resources (both people 
and hardware) to be joined together for some period of time and then split back 
apart afterwards. 

The Harness project is a collaboration between the original PVM developers: 
ORNL, UTK, and Emory University. Building on our experience with PVM, the 
project goal is to create a fundamentally new heterogeneous virtual machine 
based on three research concepts: a parallel plug-in environment - extending the 
concept of a plug-in into the parallel computing world, distributed peer-to-peer 
control - eliminating single (and multiple) points of failure, and merging/spliting 
of multiple virtual machines to support advanced collaboration between research 
teams. An initial prototype of Harness was released this past Spring and plug-ins 
to provide PVM and a fault tolerant MPI are nearing completion. An update 
on the status of Harness will conclude the talk. 

Links to information about all the projects mentioned in this extended ab- 
stract can be found on the author’s home page: www.csm.ornl.gov/ geist. 




Managing Your Workforce on a Computational Grid 



Miron Livny 



Computer Sciences Department 
University of Wisconsin-Madison 
1210 West Dayton St. 
Madison, WI 53706 
MironOcs . wise . edu 



Abstract. The Master-Worker distributed computing paradigm has proven to be 
a very effective means for harnessing the power of computational grids. At any 
given time, the master of the application controls a collection of CPUs that has 
been allocated to the application by the resource manager of the grid. Effective 
management of this dynamic "workforce" of CPUs holds the key to ability of 
the application to meet its computational objectives. Like in similar real-life 
situations, the master has to decide on a target size and composition for the 
workforce, a recruiting strategy and a dismissal policy. It has to decide on who 
does what and how to deal with workers that do not complete their assigned 
task on time. 



Introduction 

Running a Master-Worker application on a Computational Grid resembles managing 
a real-life factory with human workers and real machines. The master of the 
application faces almost the same challenges a human factory manager does. All the 
resources of the grid are potential candidates to join the "workforce" of the 
application. The availability and properties of these resources are very dynamic and 
unpredictable. It is up to the master to recruit these resources and integrate them into 
its workforce. Once a resource joins the workforce, the master has to decide on the 
work to be allocated to the worker, how to monitor the worker's progress and what to 
do if the worker quits or does not complete the assigned task on time. Grid resources 
can very in speed, reliability and cost. Some of them may be available for long time 
periods while others may only join a workforce for short intervals. 

We are currently engaged in a number of efforts that address different aspects of this 
management problem. These efforts include the development of a framework for 
dealing with the managing the workforce of grid resources, implementation of tools 
and a runtime support library for Master-Worker application and experimentation 
with large-scale applications. Recently, we completed a large Master-Worker 
computation that consumed almost 12 CPU years of grid resources in less than a 
week. The grid had more than 2300 CPUs and was actively used by other 
applications. At one point the workforce of the computation reached 1009 workers. 

J. Dongan-a et al. (Eds.): EuroPVM/MPI 2000, LNCS 1908, pp. 3-4, 2000. 
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On the average the size of the workforce was 650. This run has demonstrated the 
strength and the weaknesses of our approach and tools. We are currently working on 
enhancing our framework and tools to address these weaknesses and to broaden the 
size and class of applications that can be served by these tools. 




Isolating and Interfacing the Components of a 
Parallel Computing Environment* 



Ewing Lusk 

Argonne National Laboratory 



Abstract. A message-passing library interface like MPI or PVM is only 
one interface between only two components the complete environment 
seen be the user of a parallel system. Here we discuss a larger set of 
components and draw attention to the usefulness of considering them 
separately. Such an approach causes us to focus on the interfaces among 
such components. Our primary motivation is the efficient use of large 
clusters to run MPI programs, and we describe current efforts by our 
group at Argonne to address some of the interface issues that arise in 
this context. 



1 Introduction 

MPI is an example of an interface, specifically, an interface between an applica- 
tion (or library) and the portable component of a communication library. MPI 
has been successful an an interface specification partly because it is only and 
interface; a specific implementation of MPI is a different object. The communica- 
tion library is only one component (or several) of the overall parallel computing 
environment seen by a user or parallel program. The desire to make the en- 
tire environment more usable, flexible, and powerful motivates us to consider 
the components of the environment separately and look at the interfaces among 
these components. In this talk we will focus on the impact of some of these in- 
terfaces on the task of implementing MPI, particularly MPI-2. We will conclude 
with a preliminary proposal for an interface between a communication library 
and a process manager, and survey some of the tasks that remain to be done. 

2 Components and Interfaces 

Any list of the components of a parallel environment is bound to be incomplete, 
but for the sake of our discussion, we need to at least identify the following: a 
process manager that starts, monitors, and cleans up after processes; a parallel 
library that implements communication among these processes, and a job sched- 
uler (perhaps nonexistent) that decides when and where to run parallel jobs. 

* This work was supported by the Mathematical, Information, and Computational Sci- 
ences Division subprogram of the Office of Advanced Scientific Computing Research, 
U.S. Department of Energy, under Contract W-31-109-Eng-38. 
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The user should perhaps also be considered a component, so that user interfaces 
(to the scheduler and process manager, for example) can be part of our discus- 
sion. These components have subcomponents, resulting in internal interfaces as 
well, such as MPICH’s Abstract Device Interface. In some systems these com- 
ponents are combined, such as in the case of a scheduler/process manager or a 
parallel library/process manager. When these components are integrated rather 
than isolated, it becomes difficult to select parts of them for replacement or to 
interface them to other components. 

In this talk we will use the MPICH implementation of MPI as a motivating 
example, showing how a precise interface to a process manager is necessary 
in order that a library like MPICH be able to use multiple process managers 
and that a given process manager support multiple parallel libraries. This is 
particularly true in the case of MPI-2 implementations, where the job scheduler 
and resource manager may need to be involved as well. 



3 A Library/Process Manager Interface 

To allow our new MPI implementation to use a process manager’s facilities 
but be independent of any specific one, even our own, we have designed an 
interface we call BNR. Requirements for BNR are that it allow a variety of 
implementations by process managers with quite different architectures, that it 
be simple enough to encourage implementation by other process managers, and 
that it be powerful enough to support MPI-2 implementation by the library. We 
have partially implemented BNR in our MPD process manager, and used the 
interface in our newly designed MPICH implementation of MPI. The focus on the 
interface will allow MPICH to be independent of any specific process manager 
yet take advantage of process manager capabilities where they are available. 
We will describe the BNR interface and how it is used to support some of the 
functionality of MPI-2. 

4 Future Work 

Several other interfaces need to be studied in order to complete the development 
of a usable, flexible, and powerful parallel environment. One of the most nec- 
essary yet most difficult is the interface between the job scheduler and process 
manager. These are two components that are frequently combined, precisely be- 
cause of this difficulty, yet scheduling and process management are quite different 
activities. All parts of the environment will benefit if interfaces are developed 
that allow components to evolve separately. 




Symbolic Computing with Beowulf-Class PC Clusters 



Dr. Thomas Sterling 

Center for Advanced Computing Research 
California Institute of Technology 
and 

High Performance Computing Group 
NASA Jet Propulsion Laboratory 



Abstract 

Beowulf-class systems are an extremely inexpensive way of aggregating substantial 
quantities of a given resource to facilitate the execution of different kinds of 
potentially large workloads. Beowulf-class systems are clusters of mass-market 
COTS PC computers (e.g. Intel Pentium III) and network hardware (e.g. Fast 
Ethernet, Myrinet) employing available Unix-like open source systems software (e.g. 
Linux) to deliver superior price-performance and scalability for a wide range of 
applications. Initially, Beowulfs were assembled to support compute intensive 
applications at low cost by integrating a large number of microprocessors primarily 
for science and engineering problems. But over the last few years, this class of 
clusters has expanded in scale, application domain, and means of use to embrace a 
much broader range of user problem. Beowulfs have become equally important as a 
means of integrating large numbers of disk drives to realize large mass storage 
support systems for both scientific and commercial applications including data bases 
and transaction processing and are becoming a major workhorse for web servers and 
search engines. Yet, Beowulf-class systems are able to assemble together large 
ensembles of yet another type of resource: memory. This possibility may enable 
domains of computation so far largely unaddressed by the distributed cluster 
community. 

One such problem domain is symbolic computing which allows the representation 
and manipulation of abstract relationships among abstract objects. Once the principal 
tool of artificial intelligence (AI), symbolic computing has received less work than 
other domains as AI had garnered less attention. In the decades of the 1970s and 
1980s, symbolic computation was the focus of significant effort with the development 
of such languages as Prolog, Scheme, OPS-5, and Common Lisp, as well as special 
purpose computers such as the Symbolics 3600 series and the TI Explorer. In 
addition, parallel symbolic computation was explored with such multiprocessor based 
systems as Concert and Multilisp. However, with the failure of AI to deliver results 
commensurate with the hype that surrounded it and the advent of more conventional 
RISC based systems that out performed the slower special purpose microprogrammed 
controlled systems, the focus on symbolic processing diminished leaving only small 
pockets of research in natural language processing and robotics among a few such 
areas. One of the factors that greatly hampered success in this regime was the 
inadequacy of the memory systems. Symbolic computation is memory intensive, 
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easily consuming an order of magnitude or more of memory compared to 
conventional applications. Beowulf-class systems offer an alternative path to 
achieving large memory systems at moderate cost. A Beowulf today can provide on 
the order of a thousand times the memory capacity available to symbolic computing 
problems of the late 1980s with total memory system prices (including PC nodes and 
networks) of significantly less than $10 per MByte. 

The Beowulf Common Lisp (BCL) project is exploring the application of Beowulf 
class systems to scalable symbolic computation. The motivation is driven both by the 
opportunity that Beowulf-class systems provide through their large aggregate memory 
capacities and by the potential application of symbolic computing to knowledge 
extraction and manipulation in the realm of large scientific applications. BCL is 
experimenting with a merger of distributed memory hardware and a programming 
model based on a global name space. The semantics of Common Lisp incorporate a 
number of intrinsic constructs which are inherently parallel, many of which lend 
themselves to distributed computing. One class of such constructs is the set of 
functional or value oriented operators that employ copies of argument structures. 
Functional semantics is well known for ease of parallelization and work cast in this 
form can be readily distributed across cluster nodes. Common Lisp incorporates a 
second set of constructs referred to as mapping functions that are intrinsically parallel 
permitting data parallel computation across corresponding elements of complex data 
structures. In addition, many Lisp operators permit out of order evaluation of 
arguments yielding yet more natural parallelism. Some of these instructions have 
corresponding operators that impose ordered evaluation (e.g. LET, LET*) providing 
synchronization where necessary. The CLOS object oriented system is one of the 
most powerful in language design and provides a natural program representation for 
parallel distribution, as well as synchronization, and encapsulation. Finally, while not 
part of the formal Common Lisp language, the "futures" construct developed initially 
by Hewitt and later by Halstead provides an important semantic tool for distributed 
coordination of symbolic functions at different levels of abstraction. 

There are many challenges to realizing an effective Beowulf Common Lisp. These 
include a distributed name space directory, movement of structures and processes 
across system nodes, dynamic memory management with automatic garbage 
collection, mapping between the Lisp semantics and the MPI to hide the explicit 
message passing mechanisms from the programmer, and a distributed form of CLOS 
for BCL. This talk will describe these challenges and the path being pursued along 
with preliminary results showing both the feasibility and early functionality. It is 
believed that the availability of a scalable distributed Common Lisp for Beowulf class 
systems will provide an impetus to the application of symbolic computing to scientific 
computing and automatic knowledge abstraction extraction. 
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In this talk, we present the evolution of the high speed networks for clusters in 
the context of on the shelves PC architecture. 

We then present our experience in designing an MPI communication system 
based on the MPICH library and targeted for the Myrinet network using our 
own firmware called BIP. 

The MPI-BIP software layer protocol implementation try to squeeze the most 
out of the high speed Myrinet network, without wasting time in system calls or 
memory copies, giving all the speed to the applications. We present the protocols 
we used for the overall design and its implementation and optimization. The 
performances obtained are then presented. 

SMP architecture offers a very good price/performance ratio. We thus design 
a special device for SMP communications, called SMP-plug that can follow the 
network performances in both latency and bandwith. We present its internal 
design and performances for MPI. 

These two open-source software design leads to parallel multicomputer-like 
throughput and latency on very cheap clusters of PC workstations. 



J. Dongarra et al. (Eds.): EuroPVM/MPI2000, LNCS 1908, pp. 9-0 2000. 
(c) Springer-Verlag Berlin Heidelberg 2000 



A Benchmark for MPI Derived Datatypes 



Ralf Reussner^, Jesper Larsson TrafP, and Gunnar Hunzelmann^ 

^ LIIN, Universitat Karlsruhe 
Am Fasanengarten 5, D-76128 Karlsruhe, Germany. 

^ C&C Research Laboratories, NEC Europe Ltd., 
Rathausallee 10, D-53757 Sankt Augustin, Germany. 
skampiOira . uka . de 



Abstract. We present an extension of the SKaMPI benchmark for MPI 
implementations to cover the derived datatype mechanism of MPI. All 
MPI constructors for derived datatypes are covered by the benchmark, 
and varied along different dimensions. This is controlled by a set of pre- 
defined patterns which can be instantiated by parameters given by the 
user in a configurations file. We classify the patterns into fixed types, 
dynamic types, nested types, and special types. We show results from 
the SKaMPI ping-pong measurement with the fixed and special types 
on three platforms: Cray T3E/900, IBM RS 6000SP, NEC SX-5. The ma- 
chines show quite some difference in handling datatypes, with typically 
a significant penalty for nested types for the Cray (up to a factor of 16) 
and the IBM (up to a factor of 8), whereas the NEC treats these types 
very uniformly (overhead of between 2 and 4). Such results illustrate the 
need for a systematic datatype benchmark to help the MPI programmer 
select the most efficient data representation for a particular machine. 



1 Introduction 

Derived datatypes in MPI provide a flexible mechanism for working with ar- 
bitrary non-contiguous layouts of data in memory. Derived datatypes are fully 
integrated into MPI, and can be used everywhere a predefined datatype is al- 
lowed, in particular as arguments in communication calls. Derived datatypes are 
useful in themselves, but additionally play an important role in the parallel I/O 
model of MPI-2 |21 It is therefore important that use of derived datatypes 
does not impair performance. Ideally, it should not be significantly more expen- 
sive to work with non-contiguous memory described by derived datatypes than 
it would be to manage such data layouts by hand. On the contrary, derived 
datatypes provide a handle for an efficient MPI implementation to avoid inter- 
mediate packing and unpacking of communication buffers that might otherwise 
be necessary when working with non-contiguous data manually. 

So far, there are no systematic benchmarks for evaluating the performance 
of derived datatypes with a given MPI implementation (on a given machine). In 
this paper we present an extension of the SKaMPI [Z| benchmark for MPI imple- 
mentations to cover also the derived datatype mechanism of MPI. Benchmarking 
has always played a particular role in high performance computing. With the 
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advent of standards like MPI for portable parallel programming, benchmarking 
of different MPI implementations on different target platforms has become im- 
portant for ensuring portability of applications also with respect to performance. 
The SKaMPI benchmark is intended as an accurate, detailed MPI benchmark 
which can be used to guide the design of efficient, portable applications. It can 
measure (nearly) all communication routines of the MPI standard, as well as all 
collective operations. Measurements can be selected from a large set of predefined 
communication patterns, either abstracting common MPI usage, or designed to 
measure key features of both hardware and implementation (bandwidth, scala- 
bility, etc.). Measurement details are controlled by the user who sets up a series 
of experiments in a configurations file. It should be noted that SKaMPI is not 
intended as a correctness or stress test of MPI. 

Often benchmarks for parallel computers are application kernels (see e.g. P 
El). Such benchmarks capture the performance of a machine in a more real- 
istic manner than peak performance figures, but application benchmarks can 
only indirectly guide the development of efficient, portable programs. A widely 
used MPI benchmark, which is in many respects similar to SKaMPI, is mpptest, 
which shipped with the mpich implementation of MPI PP. It measures (nearly) 
all MPI operations, but is less configurable than SKaMPI. A database of imple- 
mentation/machine results gathered with mpptest is not maintained. 

2 The SKaMPI Benchmark 

The goal of SKaMPI is to collect performance data of different MPI implemen- 
tations on different parallel platforms to guide: (1) the optimization of parallel 
programs in early stages of development, (2) the development of portable pro- 
grams with good performance on different platforms, and (3) the optimization of 
a given MPI implementation on a given platform. With this data the developer 
can take different design alternatives into account already in the design phase to 
choose the optimum with respect to the considered target platforms. To make 
performance data available also to developers without access to a specific target 
platform the data gained with SKaMPI can be submitted to the SKaMPI per- 
formance database: http://wwwipd.ira.uka.de/~skampi in Karlsruhe. 

Problems of benchmarking parallel computers and MPI in particular were 
recently discussed in 00 To provide a reliable, reproducible evaluation of the 
performance of a given MPI implementation on a given machine, SKaMPI makes 
use of mechanisms for: 

automatic control of the standard error: single measurements are repea- 
ted until the standard error drops below a user defined threshold (or a maxi- 
mum number of repetitions is reached). Outliers are discarded, and the mean 
of the results is taken. 

automatic parameter refinement: the arguments where to measure (e.g., 
the message length) are computed in dependency of previous measurements. 
This makes it possible to quickly and automatically focus on interesting 
performance features, without using a too finely grained, uniform scale. 
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The purpose of these mechanisms is to spend the running time at the “inter- 
esting” measurements, e.g. switching points of algorithms, and measurements 
disturbed by the external environment. As mentioned, SKaMPI is controlled 
by a customizable configuration file, in which the user sets up his measurement 
suite. SKaMPI also includes a report generator, which presents the output in 
a humanly readable form. These reports can also contain comparisons between 
different measurement suites. Ideally, new measurements are reported to the 
SKaMPI database in Karlsruhe. The figures in Section 0 were all generated 
automatically from the database. 

3 The Derived Datatype Benchmark 

We were faced with two alternatives for how to incorporate derived datatypes 
into SKaMPI. One was to define a reasonable selection of “typical” instances, 
with the danger of missing out this or that aspect which is important for 
some particular MPI implementation or user. The other was to give the user 
of SKaMPI the freedom to define the derived datatypes to be measured, with 
the danger of putting so much burden of definition on the user that the facil- 
ity will not be used. We opted for the first alternative, which fits well with the 
overall concept of SKaMPI. 

The test datatypes are synthetic, but intended to capture typical usage of 
derived datatypes. All derived datatype constructors of MPI are covered in pat- 
terns and combinations that reflect common usage. The suite is completed with 
instances to probe for special optimizations that might be incorporated in a 
specific MPI implementation. 

A measurement with derived datatypes is described in the configurations 
file by defining a base type over which the selected send type and receive type 
are constructed. All communication in the measurement is then done using the 
receive and send types thus constructed. Each derived datatype specifies the 
same amount of base type units, so receive and send type can be chosen in- 
dependently of each other and will always match. The base type is the unit 
of communication, and can be either of the MPI predefined types. In addi- 
tion the user can define an MPI structured type to be used as base type. This 
is done by supplying a list of triples of counts, offsets and predefined types, 
(ci, oi, ti), (c2, 02, ^2)5 ■ • ■ j (cfc, o/c, t/c). A structure with k blocks is constructed 
with block i consisting of Ci units of type ti starting at offset Oi. Each count 
must be > 0 , but negative offsets are allowed. This base type makes it possible 
to test more complicated instances of nested datatypes. 

We classify the SKaMPI predefined datatype patterns into fixed derived types, 
dynamic derived types, nested derived types, and special derived types. All of these 
patterns can be freely combined with the SKaMPI communication measurement 
patterns. 

To measure the performance of “computing collectives” like MPI_Reduce on 
derived datatypes an operator must be defined for each derived datatype. This 
is a delicate issue, since the time spent in performing the operation is counted 
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in the total time of the reduction. How time-consuming is the “typical” user 
defined operation? How well can the “typical” user defined operation exploit 
the given machine, e.g. does it have loops that can be vectorized? Letting the 
SKaMPI user fully control the operations to be performed is fair for the individ- 
ual user/machine, but tiresome for the user, and makes comparison among differ- 
ent machines and implementations difficult. We compromise by taking the “copy 
first argument” function (also known as MPI_Replace in MPI-2), /(a, b) = a, as 
the operation for derived datatypes. The operation is implemented by means of 
MPI_Pack and MPI_Unpack. The first (input) argument is packed into an inter- 
mediate buffer, and unpacked to the output argument. This solution is general- 
purpose, fair to the MPI implementations, and does not require user intervention. 
There is no significant disadvantage to vector-machines; packing and unpacking 
can be efficiently vectorized as shown in [Oj. 

We now describe the predefined derived datatype patterns, assuming the 
reader is familiar with the MPI standard terminology for derived datatypes |E|. 
The parameters BLOCKS, BLOCKSIZE, and VECTORSTRIDE control the data layout, 
and are set by the user in the configurations file. A length-parameter £ is varied 
by SKaMPI. The number of base type units communicated for each derived type 
pattern is £ x BLOCKS x BLOCKSIZE. 

3.1 Fixed Derived Datatypes 

A fixed derived datatype describes a memory layout consisting of a fixed number 
of units of the base type, independently of the communication volume of a mea- 
surement. A fixed datatype is defined for each of the type constructors of MPI- 1. 
Also the new array types introduced in MPI-2 standard can be measured, but 
we will not comment on the MPI-2 types here. 

1. A fixed contiguous type with BLOCKS*BLOCKSIZE base type units. 

2. A fixed vector or MPI hvector consisting of BLOCKS blocks, each of BLOCKSIZE 
base type units and with stride VECTORSTRIDE. 

3. A fixed indexed or structured type consisting of BLOCKS blocks. Even num- 
bered blocks consist of BLOCKSIZE base type units, odd numbered blocks of 
BLOCKSIZE-1 units, with the last block being large enough for a total number 
of BLOCKS X BLOCKSIZE units. Blocks are spaced one base type unit apart to 
keep the resulting type non-contiguous. 

3.2 Dynamic Derived Datatypes 

Instead of sending batches of fixed types, we also measure the performance when 
sending only one instance of a type. The right amount of data to be sent is 
controlled by having the length-parameter £ be part of the datatype. Three 
dynamic vectors/structures, where £ appears in different positions, are defined: 

1. vector/struct with BLOCKS blocks, each of £ x BLOCKSIZE elements 

2. vector/struct with £ blocks, each of BLOCKS x BLOCKSIZE elements 

3. vector/struct with £*BL0CKS blocks, each of BLOCKSIZE elements 
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For vectors the stride is the proper multiple of VECTORSTRIDE, allowing for 
non-contiguous as well as contiguous vectors. For structs the blocks are spaced 
one base type unit apart such that the resulting type does not specify a contigu- 
ous segment of memory. 

3.3 Nested Derived Datatypes 

We have defined both static and dynamic nested derived datatypes, in both cases 
with two levels of nesting. There are four static nested types: a vector/indexed 
type of BLOCKS blocks, each consisting of a vector/indexed type of BLOCKSIZE 
blocks, each of length one and stride two. Three dynamic nested vectors are 
defined, depending on where the length-parameter appears: 

1. BLOCKS vectors, each a vector of BLOCKSIZE blocks of size I 

2. BLOCKS vectors, each a vector of i blocks of size BLOCKSIZE 

3. i vectors, each a vector of BLOCKS blocks of size BLOCKSIZE 

The user can add one extra nesting by setting up a structured base type in 
the configurations file. 

3.4 Special Derived Datatypes 

To be able to test for special optimizations in an MPI implementation, like 
detection of larger consecutive segments, we have defined a special (fixed) MPI 
struct, which by means of an overlapping vector, an indexed and a struct span 
a consecutive memory segment. An MPI implementation may detect this and 
treat the consecutive segment as such. 



3.5 Process Local Handling of Datatypes 

SKaMPI provides so called simple patterns for measuring MPI operations with 
local completion semantics. We have extended the benchmark with simple pat- 
terns for measuring the costs of defining and committing the derived datatypes 
discussed above. 

4 Example Measurements 

We illustrate the use of the SKaMPI datatype benchmark with three different 
platforms: Cray T3E/900, IBM RS 6000 SP, and NEC SX-5. We will not go into 
the characteristics of these machines here, but comment only on what can be 
observed from the benchmarks. All three machines run the vendor MPI. Due to 
space limitations we only show the results obtained with the SKaMPI send- 
receive ping-pong pattern, varying over message length^ The base type is 
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MPI_INT, and the parameters BLOCKS, BLOCKSIZE and VECTORSTRIDE are set 
to 10, 7, and 11, respectively. The times reported in the figures are the total 
time of an MPI send and an MPI receive operation. 

Figures^, El andOlshow the performance of the ping-pong measurement with 
all fixed derived types and the special, contiguous type, compared to the perfor- 
mance with contiguous MPI_INT data. None of the MPI implementations detect 
that the special type span a contiguous memory segment, but all implementa- 
tions treat the contiguous derived data type as a contiguous segment (as should 
be), obtaining performance similar to non-structured MPI_INT. The NEC SX-5 
MPI implementation treat all types roughly equally, with a factor 2 to 4 penalty 
over contiguous data. For the Cray T3E the MPI Hvector interestingly seems 
to behave better than vector. There is a considerable penalty for the complex 
(nested), special type, up to a factor 16. Also the IBM SP has a considerable 
overhead for the special type (about a factor 8). 




Fig. 1. Cray T3E: Fixed and special types, ping-pong pattern. Message lengths 
are in bytes, and the times are the time for a send and a receive operation. 



We also investigated the performance of the dynamic structured types with 
the reduce pattern (measurement of MPI_Reduce). For all three machines, there 
is a noticeable, non-constant overhead for the structured types where the num- 
ber of blocks is proportional to the length-parameter The structure with a 
fixed number of blocks (and blocksize proportional to t} is handled well by all 
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Fig. 2. IBM RS 6000SP: Fixed and special types, ping-pong pattern. Message 
lengths are in bytes, and the times are the time for a send and a receive operation. 



machines, with performance close to that of contiguous MPI_INT data. Finally 
we studied the performance of the nested, dynamic vectors with the ping-pong 
communication pattern. Here the Cray T3E shows a considerable overhead of a 
factor up to 8 for the types where the number of vectors at the outermost level 
are proportional to but handles the other extreme very well with performance 
close to that of contiguous data. The IBM SP also handles the vector where 
the innermost blocksize is proportional to £ well, with an overhead of a factor 
about 2 for the other cases. The NEC SX-5 handles all cases similarly, with an 
overhead of a factor about 2. 
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Abstract. Four different benchmarking suites for testing the perfor- 
mance of MPI routines have been considered on an SGI 0rigin2000. 
Special properties of these benchmarking suites which are mostly hidden 
to the user turned out as being of considerable influence on the bench- 
marking results for ccNUMA systems such as number and location of 
buffers, warm-up of the cache before running the benchmark, or proce- 
dure of measuring the time. In addition, we consider interpretation of 
results and their approximation by piecewise linear curves. 

Key Words: message passing, performance analysis, benchmarking, 
MPI, ccNUMA architectures. 



1 Introduction 

There are several possibilities for measuring performance characteristic of MPI 
message passing routines by the use of existing benchmarking suites. Four of 
them have been used to achieve reliable performance parameters of an SGI Ori- 
gin2000: mpptest P, SKaMPI P, MPBench |S], and PMB p. At a glance, all 
codes were able to deliver a good overview and similar results. A detailed eval- 
uation, however, showed differences among measurements which could not be 
accepted without analyzing the reasons. 

In the present paper, we investigate unexpected properties of the bench- 
marking suites. We study effects observed on running a single round trip with 
MPI_Send/MPI_Recv. All performance figures are, therefore, related to a complete 
round trip. We mainly consider effects caused by the cache structure and param- 
eters. We leave open the question in which case the results may be of interest for 
performance evaluation of user programs or what else should be benchmarked 
for this purpose. For recent numerical results of a great variety of MPI routines, 
we refer to other publications like 0. 

The considered computer architecture is a 4 processor 0rigin2000 which is a 
ccNUMA system with RIOOOO processors and IRIX 6.5 for operating system. For 
MPI the native SGI implementation release 3. 1.1.0 with default parameters was 
used (i.e. in particular: 16 MPI buffers of 16KB each per process and in addition 
16 buffers per host). The considered machine runs at 195 MHz. Each processor 
has an LI cache for data only and an L2 cache for data and instructions. The 
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Fig. 1. Overview on Performance values for all message sizes. 



LI cache is a 2- way set associative LRU write back cache of 32 KB with a cache 
line of 32 bytes. Because of 16KB pages one layer corresponds to a page. The 
L2 cache is a 2-way set associative LRU write back cache of 4 MB with a cache 
line of 128 bytes. Memory access needs about 2-3 cycles for the LI cache, 10 
cycles for the L2 cache, and 75-300 cycles for the main memory. All processors 
are interconnected via a network such that each processor is able to access each 
part of the global memory. For more details we refer to j2j. 

Message transmission is always a memory transfer in this system. For bench- 
marking we only considered pairs of processors residing on different boards (mode 
ppml). The SGI implementation of MPI uses always buffering of messages. If 
we use letters u, m, s, r, b, and b' for user, MPI, send, receive, bujfer in the 
master process, or buffer in the slave process resp. then a round trip (ping-pong) 
benchmark except for very small messages follows the scheme 



bms-i {^ms ^ ^77) 



^ur 5 ^us 



5 ^777 



( 1 ) 



We mainly discuss activities of the master process. The activities enclosed by 
braces are executed by the slave process. A benchmarking routine for message 
passing routines normally works like the following program: 

while not_all_sizes_considered do 
select_next_message_size ; 
while further_measurement_is_requested do 
initialize_timer ; 
for number_of _repetitions do 
execute_test_program; 
enddo ; 

save_timing_values ; 
enddo ; 
enddo ; 

Table E gives a rough impression which possibilities are provided by the consid- 
ered benchmarks. 
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mpptest MPBench PMB SKaMPI 


message sizes 


explicit list or mixed geomet- predefined arithmetic 

automatic ric/arithmetic sequence and 

selection sequence automatic selection 


measurements 
per size (m) 


until standard user defined 1 between user limits 

deviation below until std deviation 

error limit below error limit 


repetitions 


user defined user defined predefined 1 


results, 

/is/transaction 

or 

transactions/s 


1st to last proc, 1st to last proc, 1st to 2nd proc, max from 1st to 

minm{time), transactions, minprocs{time), any proc in use, 

maxm{time), values for all max procs (time), avm (time) 

avm{time) measurements aVprocs{time) 


buffers 


2 2 2 1 


buffer ad- 
dress in page 


different, 5920 not considered not considered different, «3400 for 
for small msgs. small messages 



Table 1. Controlling the run of benchmarking Send/Receive and form of results. 



2 Overview and Break Points of Runtime Curves 

Without considering break points caused by the transfer protocol or by exceeding 
machine parameters, a precise result cannot be expected for all message sizes 
from 0 to 16 MB. Figured shows the result of all considered benchmarking tools 
for round trip with MPI_Send/MPI_Recv. About 125 message sizes have been used 
with increasing distance. The MPBench offers the most convenient way to define 
the arguments for this first step. The user can start with a sequence 2* (in our 
case i = 0, , 24) and can request a certain number of intermediate arguments 
for all intervals of this sequence. The other tools offer less convenient ways to 
produce an appropriate list. The automatic selection of arguments turned out 
to be not useful for this wide range of sizes. 

Figuredshows that all benchmarking tools deliver similar results. Here we see 
that MPBench is highly sensitive to unavoidable disturbations in particular for 
small messages. We see break points at 2®, 2^®, 2^"*, and 2^^ bytes. 2^®, 2^^ mark 
breaks in the transmission protocol and we use these break points to separate 
intervals of message sizes for discussion in greater detail. 

3 Small Messages 

Messages below 64 bytes and messages between 64 and 1024 bytes can be studied 
in common. Messages up to 64 bytes are sent immediately. Up to 1024 bytes the 
remainder is sent afterwards following a homogeneous protocol. The left part of 
FigureElshows measurements with the four benchmark suites. Values of mpptest 
(minimum values as recommended in d) show the clearest timing figures. As 
PMB and SKaMPI deliver average values and the measurements are not com- 
pletely free of disturbations, these benchmarks show a little higher values. It was 
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Fig. 2. Performance values for round trip of small message sizes. 



a general observation that MPBench is rather sensitive against disturbing ac- 
tivities on the system. Moreover, there is no cache warming run with MPBench 
and mpptest as for the other tools. Therefore, it is necessary to use a high rep- 
etition number (> 50) to get reliable values using these benchmarking suites. 
Though we selected minimum values of measurements for each message size also 
for MPBench, this tool showed the highest values. There is no explanation of 
this effect. We only observed that the timer is different. 

For the left part of Figure |2| we considered a list of dense arguments as far 
as possible for providing already a complete curve by the dots alone. In the right 
part, we considered arguments selected by mpptest automatically. The tool had 
some problems to distribute its activities evenly over the whole range and the 
break points are not always clearly defined. Therefore, it is useful to apply an 
intelligent approximation algorithm for producing a piecewise linear curve. Our 
algorithm allows to specify a maximum deviation e of values from the curve and 
to suppress a small number / of subsequent measurements if they are considered 
erroneous. It executes the following steps: 

1. Group values starting at the end of the set into subsets representing linear 
curves (deviation < e). Start with 3 points on a straight line and extend the 
line to smaller arguments as long as no more than / subsequent arguments 
have to be left out. Repeat this recursively until all points are considered. 

2. If there is a non-monotonic part of the curve consisting of a sequence of linear 
pieces each of which contains two arguments only, try to form a monotonic 
curve in the following way: suppress the first and the last argument and 
group the points again in pairs. Remove all single linear pieces showing the 
wrong direction from a global point of view. 

3. Finally try to integrate all single arguments laying between two subsequent 
linear pieces into one of the neighboring pieces. 

4. Define break points in the middle of two subsequent linear pieces. 
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Fig. 3. Performance values for round trip of medium message sizes. 



The results of this algorithm can be seen in the right part of Figure Ufor / = 2 
and different values of the deviation. The algorithm is considered a first step 
working mostly locally on the set of points. Global operations could improve the 
result considerably but this has not yet been done. Such an approximation could 
be used to decide automatically which measurements should be removed at all 
and where break points of the timing curve can be assumed. 



4 Messages of Medium Size and the LI Cache 

Measured values of medium sized messages are summarized in Figure 0 Except 
for a few mismeasurements the values differ systematically to a certain extent. In 
particular the values of mpptest jump up and down. It turned out that number 
and location of user buffers explain this behavior, mpptest uses 2 subsequent 
buffers the location of which changes in a complicated way with the message 
size. Therefore, we developed a test routine which uses a send buffer starting 
at a well defined offset relative to the beginning of a page and a receive buffer 
immediately behind the send buffer. Figure^ shows the results for various offsets. 
For demonstrating the effects more clearly, we used different scales for each curve 
by adding offset /40 to the time values. 

Considering Figure0we can observe two different effects: First there is always 
a break at 8KB, i.e. the curve is a little steeper above 8KB, and second there 
are jumps after which the curve continues at higher level for a while. 

The break at a size of 8MB is a clear LI cache effect. The transmission speed 
depends on the number of cache misses. In the case of critical overlays of buffers 
in a cache (more than 2 different cache lines per set), there are cache misses for 
each access in this particular region because no data can survive cycle (O in the 
cache. In LI cache, the 2 user buffers {bus and bur, see (13) already occupy one 
line in each set and both lines in an increasing number of sets beyond a size of 
8MB. The same is true for the two MPI buffers. Therefore, there is an increasing 
range of critical overlay. 
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Fig. 4. Shape of time figures for round trip depending on the offset of user 
buffers. 




Fig. 5. Shape of jumps in time figures for round trip using offset 4160. 



If the location of the second MPI buffer changes after the size of the first 
buffer has exceeded a certain threshold, the new location might cause a critical 
overlay in LI cache for a block of data which was not present before (see Fig- 
ure El- Thresholds of this kind have been discovered at 2KB, 4KB and 8KB. We 
observed indirectly locations of 2KB-|-£, 4KB-|-£, 8KB-|-£, and 0KB-|-e relative 
to the start of a page for bmr and The value of e was around 400 bytes. The 
size of the critical overlay depends on the location of the user buffers. 

If there are 2 user buffers and bur is located directly behind bus then the 
starting address of bur moves ahead with the size of the bus- But the MPI buffer 
bmr is a little ahead of the b'ms- Therefore, the transfer b'ms bm,r leaves the bmr 
unfortunately in the status least recently used in the LI cache for the majority 
of sets. If the starting address of bur is also ahead of bmr the store operations 
of bmr bur desti'oy the bmr before it can be read. This effect leads to the 
jumps near message sizes 14776, 12776, 10776, and 8776 KB within the curves 
for 2000, 4000, 6000, or 8000 bytes offset for bus- If bmr jumps away above a 
certain message size this effect will disappear again. 

In order to better understand the special form of this second kind of jumps, 
we consider a finer resolution in Figure 0 While the upper curve t = f{s) in 
Figure 0shows the time t for certain offsets of bus and message sizes s increasing 
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Fig. 6. Overlay in LI cache in the case of two user buffers and 12k offset 



with 4 bytes, the lower curves shows t' = f{s) — f{s + 128). We see that the 
disturbation is periodic with 128 bytes in size and the first period is overlaid by 
a special effect. This effect is caused if the normal write back functionality of 
LI and L2 caches for the bur takes place together with reading the bmr from L2 
cache after LI cache misses in the same set of cache lines. The question has to 
be left open here why the considered effect takes place only if the b^r starts at 
a double word address (i.e. the address is 0 (mod 8)) or for the first 3 LI cache 
lines which lie over an L2 cache line. 

If we use j for the message size where a jump of second kind takes place and 
a for the offset of bus then we observe a + j = 16776 for message sizes above 8KB 
and offsets below 8KB which is 392 (mod 16384). Therefore, we can assume a 
starting address 384 (mod 16384) of bmr- Similar series of jumps can be found 
for message sizes above and offsets below 4KB or 2KB. 

Figure El shows the overlay of buffers in the LI cache. This is a hypothetical 
situation which shows the principles for message sizes around 8k bytes. Just at 
this point, we observed a jump in the timing curve of mpptest in any case but 
for SKaMPI results in some cases only. Since the exact location of MPI buffers 
is not completely known, some questions have to be left open. 



5 Large Messages and the L2 Cache 

According to Figure E the time for round trip seems to be a simple curve in 
the case of messages above 16KB. The bandwidth, however, shows a very com- 
plex behavior. Here we use 2*message^ize/time_for_round_trip for bandwidth. 
At the beginning, the bandwidth is increasing with the message size as usual 
(see Figured). Before saturation could be reached, the TLB is exhausted (the 
TLB contains entries for translation of virtual addresses for 64 pages). In the 
case of two user buffers, a size of 1/4MB requires 2 x 16 pages in addition to the 
2 X 16 pages for MPI buffers. If this limit is exceeded, the bandwidth decreases 
with message size as more and more pages cannot survive the round trip in 
TLB. Beyond 1/4MB there is an additional break in the transmission protocol 
above which value messages are moved between user buffers directly. Close to 
2MB the L2 cache is no longer able to save two user buffers and the bandwidth 
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Fig. 7. Performance values for round trip of large message sizes. 



decreases until the final level is reached at 4MB which is the size of L2 cache. 
The SKaMPI results look considerably better between 1/4 and 4MB because 
SKaMPI uses one user buffer only. 

6 Concluding Remarks 

Time cost for a round trip with MPI routines depend on the number of user 
buffers and on their location relative to page addresses. This is particularly im- 
portant for messages of medium size. Because of the considerable influence of the 
cache on the speed of message transfer, the user has to decide whether standard 
benchmarking suites are useful for his purpose or if he should concentrate on 
test routines which suppress this influence. 

The selection of message size where the performance has to be measured is 
still a problem. Good approximation of resulting curves is required in order to 
identify points of greatest uncertainty in a time function from a global point of 
view. We presented a heuristic algorithm for this purpose but the result is not 
yet satisfactory. 

Selection of reliable values out of a collection of measurements for the same 
size is an open problem. The selection procedure has to decide if differing values 
represent right and wrong measurements or if they represent different correct 
values obtained for various possible cases of execution. Average or minimum 
values as used so far are not satisfying. 
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Abstract. The total computing capacity of workstations can be har- 
nessed more efficiently by using a dynamic task allocation system. The 
Esprit project Dynamite provides such an automated load balancing sys- 
tem, through the migration of tasks of a parallel program using PVM. 
The Dynamite package is completely transparent, i.e. neither system 
(kernel) nor application program modifications are needed. Dynamite 
supports migration of tasks using dynamically linked libraries, open Hies 
and both direct and indirect PVM communication. In this paper we 
briefly introduce the Dynamite system and subsequently report on a 
collection of performance measurements. 



1 Introduction 

With the continuing increases in commodity processor and network performance, 
distributed computing on standard PCs and workstations has become attractive 
and feasible. Consequently, the availability of efficient and reliable cluster-man- 
agement software supporting task migration becomes increasingly important. 

Various P VM |Sj variants supporting task migration have been reported, such 
as tmPVM [Tg, DAMPVM 0, MPVM (also known as MIST) 0, ChaRM g| 
and CoCheck . For MPI , task migration has been studied in Hector 0 . 

Building on earlier DPVM work by L. Dikken et al. 0, we have developed 
DynamiteQ. Dynamite 0 attempts to maintain optimal task allocation for par- 
allel jobs in dynamically changing environments by migrating individual tasks 
between nodes. Task migration also makes it possible to free individual nodes, 
if necessary, without breaking the computations. 

Dynamite supports applications written for PVM 3.3.x, running under So- 
laris/UltraSPARC 2.5.1, 2.6, 7 and 8. Moreover, it supports Linux/i386 2.0 and 
2.2 (libc5 and glibc 2.0 binaries; glibc 2.1 is not supported at this point). 

^ Dynamite is a collaborative project between ESI, the Paderborn Center for Parallel 
Computing, Genias Benelux and the Universiteit van Amsterdam, partly funded 
by the European Union as Esprit project 23499. Of the many people that have 
contributed, we can mention only a few: J. Gehring, A. Streit, J. Clinckemaillie, 
A.H.L. Emmen. 



J. Dongarra et al. (Eds.): EuroPVM/MPI2000, LNCS 1908, pp. 27-^21 2000. 
(c) Springer-Verlag Berlin Heidelberg 2000 
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The principal advantages of Dynamite are its API-level transparency, its 
powerful, dynamic loader based checkpoint/migration mechanism and its sup- 
port for the migration of both direct and indirect PVM connections. We have 
found Dynamite to be very stable. Its modular design greatly facilitates the port 
to MPI P3], which is currently underway. 

2 Dynamite Overview 




Fig. 1. Dynamite run-time system. An application has to be decomposed into 
several subtasks already. An initial placement is determined by the scheduler. 
When the application is run, the monitor checks the capacity per node. If it is 
decided that the load is unbalanced (above a certain threshold), one or more 
task migrations may be performed to obtain a more optimal load distribution. 



The Dynamite architecture (see Figure P) is built up from three separate 
parts: 

1. The load-monitoring subsystem. The load-monitor should leave the compu- 
tation (almost) undisturbed. 

2. The scheduler, which tries to make an optimal allocation. 

3. The task migration software, which allows a process to checkpoint itself and 
to be restarted on a different host. Basically, the checkpoint software makes 
the state of a process persistent at a certain stage. 

Parallel PVM applications consist of a number of processes (tasks) running 
on interconnected nodes constituting a PVM virtual machine. A PVM daemon 
runs on every node and communicates with other daemons using the UDP/IP 
protocol. PVM tasks communicate with each other and with PVM daemons 
using a message-passing protocol. PVM message passing is reliable: no messages 
can be lost, corrupted or duplicated and must arrive in the order sent. 

In Dynamite, a monitor process is started on every node of the PVM virtual 
machine. This monitor communicates with the local PVM daemon and collects 
information on the resource usage and availability, both for the node as a whole 
and individually for every PVM task. The information is forwarded to a central 
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scheduler, which makes migration decisions based on the data gathered. PVM 
daemons assist in executing these decisions. 

For migration, first, the running process must be checkpointed, i.e. its state 
must be consistently captured on the source node. Next, the process is restored 
on the destination node; its execution resumes from the point at which the source 
process was checkpointed. Typically, the original process on the source node is 
terminated. 

Processes that are part of the parallel P VM application present additional dif- 
ficulties. Every PVM task has a socket connection with the local PVM daemon. 
This connection is used for the indirect routing. PVM tasks can also establish 
point-to-point direct TCP/IP communication channels with each other, to im- 
prove the performance. Extra care must be taken when migrating PVM tasks 
to ensure that they do not permanently lose the connection with the rest of the 
parallel application, and that the PVM message protocol is not violated. 

In Dynamite robust mechanisms for address translation, connection flush- 
ing and connection (re-) establishment have been incorporated that have been 
demonstrated to survive thousands of consecutive migrations. 

For a detailed description of the implementation, the reader is referred to |0|. 

3 Performance Measurements 

In order to evaluate Dynamite’s performance, a number of tests have been con- 
ducted. Some of these are concerned with the performance of the components 
of the system, such as the modified P VM library. Others attempt to quantify 
the performance of the Dynamite system as a whole, in a controlled dynamic 
environment. 





a b 



Fig. 2. Migration performance of DPVM for (a) Linux and (b) Solaris. 
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Fig. 3. Communication performance in DP VM and P VM for (a) Linux and (b) 
Solaris. 

3.1 Performance of System Components 

In a system like Dynamite, there are two easily measurable performance factors: 

— the time it takes to migrate a task of a given size, 

— the difference in communication performance compared to standard PVM. 

Experiments have been performed to measure these two factors, both under 
Linux and Solaris. In case of Linux, reserved nodes of a PC-cluster have been 
used, equipped with PentiumPro 200 MHz CPU and 128 MB RAM, running 
kernel version 2.2.12. In case of Solaris, idle UltraSPARC 5/10 workstations 
have been used, equipped with 128 MB RAM and running kernel version 5.6. In 
both cases, 100 Mbps Ethernet was used. In both cases the NFS servers used for 
checkpoint files were shared with other users, which could affect the performance 
to some extent. 

Figure 121 presents the performance of migration in DPVM for various process 
sizes. A simple ping-pong type program communicating once a few seconds via 
direct connection was migrated, process size was set with a single large malloc 
call. Execution time of each of the four migration stages (see 0) was measured. 
In general, it was found that the major part of the migration time is spent on 
checkpointing and restoring, the remaining stages amount to approximately 0.01 
- 0.03s, and hence are not shown. The speed of checkpointing and restoring is 
limited by the speed of the shared file system. On our systems this limit lies at 
4-5 MB/sec for NFS running over the 100Mbps network. It can be observed, 
however, that the restoring phase under Linux takes an approximately constant 
amount of time, while it grows with process size under Solaris, resulting in twice 
larger migration times for large processes. This is a side effect of differences in the 
implementation of malloc between the two systems. For large allocations, Linux 
creates new memory segment (separate from the heap) using mmap, whereas 
Solaris always allocates from the heap with sbrk. When restoring, the heap and 
stack are restored with read, which forces an immediate data transfer. However, 
for the other segments our implementation takes advantage of mmap, which uses 
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more advanced page on demand technique. Since the allocated memory region 
is not needed to reconnect the task to the PVM daemon, the time it takes to 
restart the task is constant under Linux. Clearly, delays may be incurred later, 
when the mmapped memory is accessed and loaded. 

In Figure 0 comparison of communication performance between DPVM and 
PVM is presented. Both indirect and direct communication performance has 
been measured. A ping-pong type program was used, exchanging messages be- 
tween 1 byte and 100KB in size. With DPVM, a slowdown is visible in all cases. 
It stems from two factors: 

— signal (un) blocking on entry and exit from PVM functions (function call 
overhead), 

— an extra header in message fragments (communication overhead). 

The first factor adds a fixed amount of time for every PVM communication 
function call, whereas the second one increases the communication time by a 
constant percentage. For small messages the first factor dominates, since there 
is little communication. An overhead from 25% for direct communication under 
Linux to 4% for indirect communication under Solaris can be observed. While 
particularly the first difference in speed is significant, it must be pointed out 
that it represents a worst case scenario. The overhead percentage is larger for 
direct communication, since the communication is faster while the overhead from 
signal blocking/unblocking stays the same. 

As the messages get larger, the overhead of signal handling becomes less 
significant, and the slowdown goes down to 2-4% for 100KB messages. 

Tests have been made to compare the communication speed in DP VM before 
and after the migration, but no noticeable difference was observed (±1%). 

3.2 Stability of the System 

Care has been taken to prove the robustness of the environment. Thousands 
of migrations have been performed both under Solaris and Linux, for processes 
ranging in size from light-weight, 2 MB processes to heavy, 50 MB and larger. 
Delays between individual migrations ranged between a fraction of a second 
and several minutes, in order to test for race conditions. Similarly, different 
communication patterns have been tested, including tasks using very small and 
very large messages, using direct and indirect communication, communicating 
point-to-point and using multicasts. These proved to be very revealing tests. 

In one test performed under Solaris, Dynamite was able to make over 2500 
successful migrations of large processes (over 20 MB of memory image size) of a 
commercial PVM application using direct connections. 

3.3 Performance of the Integrated System 

Benchmarks In order to assess the usefulness of the integrated system, reserved 
nodes of a cluster have been used to run a series of parallel benchmarks under 
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several different conditions. The benchmarks in question originate from the NAS 
Parallel Benchmarks suite m The individual benchmarks have been adjusted 
to use four computation tasks each, running for aproximately 30 minutes in an 
optimal situation. Where necessary, code has been added to provide intermediate 
information on the execution progress of each task. 

Eight nodes of a Linux cluster were reserved, each equipped with PentiumPro 
200 CPU and 64 MB RAM, running Linux kernel version 2.0.36. 100 Mbps 
FastEthernet was used as the communication medium. The number of nodes 
exceeds the number of tasks, so this is a sparse decomposition, and consequently 
during the execution of the benchmarks some nodes are idle. The Dynamite 
scheduler works best in such a situation, since it can migrate tasks away from 
overloaded nodes to idle nodes. 





No 

load 


Lo 

Dynamite 


ad 

No Dynamite 


eg (smallest eigenvalue approximator) 


1795 


2226 (-t24%) 


3352 (-b87%) 


ep (embarrassingly parallel) 


1620 


1773 (-t9%) 


1919 (-bl8%) 


ft (discrete Fourier transform) 


1859 


2237 (-t20%) 


2693 (-b45%) 


is (integer sort) 


1511 


1758 (-tl6%) 


1688 (-bl2%) 


mg (discrete Poisson problem) 


1756 


1863 (-t6%) 


2466 (-b40%) 



Table 1. Execution times of NAS parallel benchmarks, in seconds. 



Table n presents the execution times of the NAS parallel benchmarks. The 
numbers in the No load column were obtained by running the individual bench- 
marks in the ideal situation, when all the nodes were totally idle otherwise. Of 
course, the results obtained this way are the best. In case of the other two Load 
columns, an external load has been applied. The external load was generated by 
running a single computationally intensive process for 5 minutes on each node 
used by the benchmark. One node at a time was overloaded in this way, and the 
external load program worked in a cycle, going back to the first node when it 
was done with the last one. Two kinds of measurements have been carried out: 
one with Dynamite running, and one without. In both cases, the benchmarks 
ran slower than without external load. However, in case of all but one of the 
benchmarks, the results obtained with Dynamite significantly outperform the 
other case, reducing the percentage of slowdown by a factor of 2 to 6. 

Figure 0 presents the execution progress of the NAS parallel benchmarks 
(due to space restrictions, only 3 of them could be included). In each case, the 
data for one of the tasks of the parallel application is shown. The left graph 
presents the time spent on executing each individual step (ideally, this should 
be a constant); the right graph presents the total time spent so far. 

In Figure 0 (a), results for eg benchmark are shown. This benchmark slows 
down 87% when subjected to external load. Such a significant slowdown is an 
indication of two things. First, large part of execution time must be spent on 
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Fig. 4. Execution progress of NAS parallel benchmarks: the time to execute one 
step (left) and the total time (right). 



computation, otherwise the external load would not affect the local task so sig- 
nificantly. Second, the communication pattern of the benchmark (global com- 
munication) forces other processes to wait for the one lagging behind, with all 
the unpleasant consequences to the performance. 

The results of ep benchmark, as presented in FigureEl(b), are different. The 
computation tasks of the ep benchmark do not communicate with each other 
at all, and consequently all of the execution time is spent on computation. In 
such a case, external load significantly hampers the performance of the affected 
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task, but, due to lack of communication, has no influence on other tasks (the 
line on the left picture is flat in the area where other tasks of the application are 
affected by the external load). 

Figure 0 (c) shows the execution of the is benchmark, the only one that 
performs worse with Dynamite running. Is is in some ways similar to ep — they 
are both only slightly affected by the external load, but the reasons for that 
are different. Just opposite to ep, in is most of the execution time is spent on 
communication: tasks communicate frequently and in large volumes. Therefore, 
the application progress is limited by the internode communication subsystem, 
not by the CPU, so an external load has little influence on the local task, and an 
even smaller one on the remote tasks. The migration decisions of the Dynamite 
scheduler are not unreasonable, but their gain fails to exceed the migration cost, 
which is rather high in this case because of large process size (40 MB). 

The large process size (30 MB) also affects the result of the ft benchmark, 
where Dynamite reduces the slowdown from 45% to 20%. The reduction would 
have been significantly larger, had the processes to be migrated been smaller. 



Standard Production Code In this test, the scientific application Grail ID] , 
a FEM simulation program, has been used as the test application. The measure- 
ments were made on selected nodes of a cluster (see Section 13. 1 II . 





Parallel 

environment 
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sparse 
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redund. 
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PVM 


1854 


2360 
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DPVM 


1880 


2468 
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DPVM + sched. 


1914 


2520 


4 


DPVM + load 


3286 


2947 


5 


DPVM + sched. -I- load 


2564 


3085 



Table 2. Execution time of the Grail application, in seconds. 



Tabled presents the results of these tests, obtained using the internal timing 
routines of Grail. Each test has been performed a number of times and an average 
of the wall clock execution times of the master process (in seconds) has been 
taken. The tests can be grouped into two (decomposition) categories: 

— sparse — the parallel application consisted of 3 tasks (1 master and 2 slaves) 
running on 4 nodes, 

— redundant — the parallel application consisted of 9 tasks (1 master and 8 
slaves) running on 3 nodes. 

To obtain the best performance, it would be typical to use the number of nodes 
equal to the number of processes of the parallel application. Neither of the above 
decompositions does that. In case of the sparse decomposition, one node is left 
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idle {PVM chooses to put the group server there, but this one uses only a minimal 
fraction of CPU time) . Such a decomposition would be wasteful for the standard 
PVM. In the redundant case, each node runs 3 tasks of the application (one of 
the nodes also runs the group server) . Although the number of nodes used when 
running the two decompositions is different, comparing the timings makes sense, 
since 3 nodes are used at any one time in each case. 

In the first set of tests presented in Table 0 standard PVM 3.3.11 has been 
used as the parallel environment. Not surprisingly, the sparse decomposition 
wins over the redundant one, since it has lower communication overhead. 

In the second row, P VM has been replaced by DP VM. A slight deterioration 
in performance (1.5-4. 5%) can be observed. This is mostly the result of the 
fact that migration is not allowed while executing some parts of the DPVM 
code. These critical sections must be protected, and the overhead stems from 
the locking used. Moreover, all messages exchanged by the application processes 
have an additional, short (8 byte) DPVM fragment header. 

In the test presented in the third row, the complete Dynamite environment 
has been started: in addition to using DPVM, the monitoring and scheduling 
subsystem is running. Because in this case the initial mapping of the application 
processes onto the nodes is optimal, and no external load is applied, no migra- 
tions are actually performed. Therefore, all of the observed slowdown (approx. 
2%) can be interpreted as the monitoring overhead. 

In the fourth set of tests an artificial, external load has been applied by 
running a single, CPU-intensive process for 600 seconds on each node in turn, 
in a cycle. Since the monitoring and scheduling subsystem was not running, no 
migrations could take place. A considerable slowdown can be observed, although 
it is far larger for the sparse decomposition (75%) than for the redundant one 
(19%), actually making the latter faster. This is a result of the UNIX process 
scheduling policies: for sparse decomposition, the external load can lengthen the 
application runtime by a factor of 2, while for the redundant decomposition by 
no more than 33%, since there are already 3 CPU-intensive processes running on 
each node, so the kernel is unlikely to grant more than 25% of CPU time for the 
external load process. This shows that sparse decomposition, although faster in 
a situation close to ideal, performs rather badly when the conditions deteriorate, 
while the redundant decomposition is far less sensitive in this regard. 

The final, fifth set of tests is the combination of the two previous tests: the 
complete Dynamite environment is running, and the external load is applied. 
Dynamite clearly shows its value in case of the sparse decomposition, where, by 
migrating the application tasks away from the overloaded nodes, it manages to 
reduce the slowdown from 75% to 34%. The remaining slowdown is caused by: 

— the time for the monitor to notice that the load on the node has increased 
and to make the migration decision, 

— the cost of the migration itself is non-zero, 

— the master task, which is started directly from the shell, is not migrated; 
when the external load procedure was modified to skip the node with the 
master task, the slowdown decreased by a further 10%. 
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Turning to the redundant decomposition, it can be observed that the Dynamite 
scheduler actually made the matters worse, increasing the slowdown from 19% 
to 25%. This result, although unwelcome, can easily be explained. The situation 
was already rather bad even without the external load: not only were all the 
nodes overloaded, they were also overloaded by the same factor (3). Therefore, 
the migrator had virtually no space for improvement, and its attempts to migrate 
the tasks actually worsened the situation. It can be argued that the migrator 
should have refrained from making any migrations in this case, though. 




Time 



Fig. 5. Execution progress of Grail for sparse decomposition. Note that the 
performance of plain PVM was measured without any load. With a simulated 
background load it would have been only slightly better than the “DPTM + 
load’ performance. 



Figure 0 presents the execution progress of Grail for sparse decomposition. 
For standard PVM with no load applied this is a straight, steep line. The other 
two lines denote DPVM with load applied, with and without the monitoring 
subsystem running. Initially, they both progress much slower than PVM: because 
the load is initially applied to the node with the master task, no migrations take 
place. After approximately 600 seconds the load moves on to another node. 
Subsequently, in the case with the monitoring subsystem running, the migrator 
moves the application task out of the overloaded node, and the progress improves 
significantly, coming close to the one of the standard PVM. In the case with 
no monitoring subsystem running, there is no observable change at this point. 
However, it does improve between 1800 and 2400 seconds from the start: that is 
when the idle node is overloaded. After 2400 seconds, the node with the master 
task is overloaded again, so the performance deteriorates in both DPVM cases. 
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4 Conclusions and Future Prospects 

Concluding, our implementation of load balancing by task migration has been 
shown to be stable. The use of the Dynamite system results in a slight perfor- 
mance penalty in a well-balanced system, but significant performance gains can 
be obtained from task migration in an unbalanced system. Improvements can 
still be made in the scheduling. 

Dynamite aims to provide a complete integrated solution for dynamic load 
balancing. A port to MPI is being implemented, in cooperation with the people 
from Hector 0 . Dynamite/DPVM can be obtained for academic, non-commercial 
use through the author^. 
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Abstract. This paper presents an extension of Dimemas to enable accurate 
performance prediction of message passing applications with collective com- 
munication primitives. The main contribution is a simple model for collective 
communication operations that can be user-parameterized. The experiments 
performed with a set of MPI benchmarks demonstrate the utility of the model. 



1 Introduction 



Dimemas has been previously used for performance prediction of message passing 
programs. In applications were communications are mainly point-to-point it has been 
demonstrated that is a valuable tool [^. The next step is to prove its utility for collec- 
tive operations. It is necessary to develop a collective communication model, as the 
point-to-point model based on the latency and bandwidth is insufficient. 

The second goal is to prove the validity of the tool for point-to-point and collective 
communications when using communication intensive benchmarks. The results ob- 
tained in communication intensive benchmarks will demonstrate the correctness of the 
models, as they stress the communication. 

The paper is organized as follows: section 2 reviews related work. Section 3 pres- 
ents Dimemas simulator and its point-to-point communication model. The collective 
operation model is presented in subsection 3.1. Section 4 reports the experiments and 
results obtained. Finally, in section 5, some conclusions are presented. 



2 Related Work 

In ^ the LogP, a model of a distributed memory multiprocessor in which processors 
communicate by point-to-point messages, is presented. The model specifies the per- 
formance characteristics of the interconnection network but does not describe its 
structure. The model is based on the following parameters: L, latency or delay to 
transmit a message that contains a word; o, overhead, length of time that a processor 
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is engaged in the transmission or reception of a message; g, gap, minimum interval 
between two messages; and P, number of processors. These parameters are not equally 
important in all situations, and it is possible to ignore one or more parameters de- 
pending on the application. 

In Q the authors present the APACHE system, a performance prediction tool for 
PVM programs. The performance model they use makes difference between compu- 
tation time and communication time. All nodes are considered to be homogeneous. 
Their approach is divided in three phases. In the first phase, the compiler constructs a 
call graph of the PVM program and creates an instrumented version. In the dynamic 
analysis phase, the instrumented PVM program is executed. It generates a set of equa- 
tions. With this information and some parameters the prediction phase evaluates the 
equati ons and obtains a performance time prediction for the program. 

In the authors present a comparison between the performance of collective 
communication primitives in different systems. The results of these experiments are 
stored as a database, and are used for performance evaluation. The response time of 
parallel programs is decomposed in local computation part (LP) and communication 
part (CP). LP time is predicted by running a program that consists of the local com- 
putation part of the program being studied. CP time is derived from the performance 
database of the communication primitives. 

In 0 a parallel simulator for performance evaluation of MPI programs is pre- 
sented. This simulator uses direct execution to obtain computation time of programs. 
One of the drawbacks of this system is that host and target processors should be simi- 
lar to obtain accurate results. Communication and I/O times are obtained by simula- 
tion. MPI calls to the MPI library are changed by calls to a library of the simulator, 
MPI-SIM. Presented results show prediction errors between 5 and 20%. 

In Q the authors present an approach similar to Dimemas. As Dimemas, a trace 
file obtained from traced execution of the parallel program on a platform different 
from the one to be evaluated is used as input to a simulator. Previous to this trace 
execution, a static analysis step is performed. As a result of this step, only one itera- 
tion of communication patterns present on the loops appears on the trace. The simu- 
lator is oriented to heterogeneous computing environments, and is obtained less accu- 
racy if is used for performance prediction of Massively Parallel Processor systems. 



3 Dimemas 

Dimemas is a performance prediction tool for message passing programs. It is a 
trace driven simulator that rebuilds the behavior of a parallel program from a trace file 
and some parameters of the target architecture. The input trace file characterizes the 
application. Initially it was developed with the aim of studying the effects of time 
sharing message-passing programs among several applications |Q. 

Besides summarized perf ormance data^Dimemas can generate trace files that can 
be viewed with Vampir ^nd Paraver H. Combining a trace driven simulator such 
as Dimemas with a visualization tool helps understanding the summarized statistics. 
The user can analyze sensitivity of its program to architectural parameters without 
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modifying the source code and run it again. In a similar way, the effect in global ap- 
plication behavior of a potential improvement in a routine can be observed. 

Other significant target in the design of Dimemas is that it should be possible to 
obtain trace files in a „normal/typical“ development environment. By „normal/typical“ 
development environment we understand a single workstation or a time-shared, 
throughput oriented, parallel machine. Dimemas allows obtaining trace files for per- 
formance analysis of message passing programs in one of such environments, without 
needing a dedicated parallel platform. From this point of view, Dimemas is a tool that 
avoids the nasty effe cts of time sharing in trace-based visualization of parallel pro- 
grams behavior Dimemas input trace files for MPI programs are generafed by 
VAMPIRtrace, an instrumented MPI library and API 

Dimemas models the target architecture (the simulated machine) as a network of 
nodes. Each node is an SMP connected to the network with a set of links and buses. 
Every node is composed of one or more processors and local memory. 

The model of the target architecture is defined by several parameters: number of 
nodes, number of processors per node, network bandwidth, communication latency, 
number of inputs and output links, number of buses, etc. On rebuilding the parallel 
program execution, Dimemas differentiates between point to point communications 
and collective communications. Point to point communication time is modeled as: 

r-i+f (1) 

where L is the latency, S the size of the message and B the bandwidth. This formula 
can be applied in a network without contention, with an unlimited number of resources 
(buses and links). To model the bisection bandwidth of the system, a maximum num- 
ber of available buses (defined by the user) are considered by Dimemas. Also, to 
model the injection mechanism, a number of input and output links between the nodes 
and the network can be defined. Half-duplex link can also be specified. 



3.1 A Communication Model for Collective MPI Operations 



Many collective operations have two phases: a first one, where some information is 
collected (fan in) and a second one, where the result is distributed (fan out). Thus, for 
each collective operation, communication time can be evaluated as: 

T = FANJN + FAN_OUT (2) 

FAN IN time is calculated as follows: 



FAN IN 



L + lx MODEL IN FACTOR 

I B ) - - 



( 3 ) 



Depending on the scalability model of the fan in phase, the parameter 
MODEL IN F ACTOR can take the following values: 
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Table 1. MODEL IN F ACTOR possible values 



MODELIN 


MODELINF ACTOR 




0 


0 


Non existent phase 


CT 


1 


Constant time phase 


LIN 


P 


Linear time phase, 

P = number of processors 


LOG 


Nsteps 


Logarithmic time phase 



In case of a logarithmic model, MODELINF ACTOR is evaluated as the Nsteps 
parameter. Nsteps is evaluated as follows: initially, to model a logarithmic behavior, 
we will have [ log 2 pI phases. Also, the model wants to take into account network 
contention. In a tree-structured communication, several communications are per- 
formed in parallel in each phase. If there are more parallel communications than avail- 
able buses, several steps will be required in the phase. For example, if in one phase 8 
communications are going to take place and only 5 buses are available, we will need 
r 8/5l steps. In general we will need [ C/b1 steps for each phase, being C the number of 
simultaneous communications in the phase and B the number of available buses. Thus, 
if stepsi is the number of steps needed in phase i, Nsteps can be evaluated as follows: 

( 4 ) 

Nsteps = ^ steps ^ 

1=1 

For FAN OUT phases, the same formulas are applied, changing SIZE IN by 
SIZE OUT. SIZE IN and SIZE OUT can be: 

Table 2. Options for SIZEJN and SIZE_OUT 

MAX Maximum of the message sizes sent/received by root 

MIN Minimum of the message sizes sent/received by root 

MEAN Average of the message sizes sent and received by root 

2*MAX Twice the maximum of the message sizes sent/received by root 

S+R Sum of the size sent and received root 



4 Model Validation 

To validate the communication model presented in previous section, several experi- 
ments were performed. The experiments were done in 64 processors SGI Origin from 
CEPBA-UPC with a set of micro-benchmarks that intensively stress some of the MPI 
communication primitives |Q. Each benchmark was imn with dedicated resources and 
the dedicated elapsed time (DET) was measured. Also, an input trace file for Dimemas 
for each benchmark was obtained by imnning them in a loaded system. 

The method we follow for the validation of the model is based on the execution of 
the simulator with different parameters. For all experiments we used ST-ORM, a tool 
for stochastic optimization to help us in the specification, execution and analysis of 
the different experiments |Q. 
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Fig. 1. Up: range of bandwidths (Mbytes/s) that lead to a predicted time between the 10% of 
error of the dedicated elapsed time; down: Range of latencies (ps) that lead to a predicted time 
between the 10% of error of the dedicated elapsed time 



4.1 General Parameters 

A configuration file is used to model the behavior of collective MPI primitives. As we 
do not know implementation details of the used MPI library, a first set of experiments 
was performed to fix this file. We found out that a reasonable model would be: 



Table 3. Parameters used to model collective operations 



Id op 


MODELJN 


SIZEJN 


MODELJDUT 


SIZEOUT 


Colletive operation 


0 


LIN 


MAX 


LIN 


MAX 


/* MPI Barrier */ 


1 


LOG 


MAX 


0 


MAX 


/* MPI Bcast */ 


2 


LOG 


MEAN 


0 


MAX 


/* MPI Gather */ 


3 


LOG 


MEAN 


0 


MAX 


/* MPI Gatherv */ 


4 


0 


MAX 


LOG 


MEAN 


/* MPI Scatter */ 


5 


0 


MAX 


LOG 


MEAN 


/* MPI Scatterv */ 


6 


LOG 


MEAN 


LOG 


MEAN 


/* MPI Allgather */ 


7 


LOG 


MEAN 


LOG 


MEAN 


/* MPI Allgatherv */ 


8 


LOG 


MEAN 


LOG 


MAX 


/* MPI Alltoall */ 


9 


LOG 


MEAN 


LOG 


MAX 


/* MPI Alltoallv */ 


10 


LOG 


2MAX 


0 


MAX 


/* MPI Reduce */ 


11 


LOG 


2MAX 


LOG 


MAX 


/* MPI Allreduce */ 
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12 LOG 2MAX LOG MIN /* MPI Reduce Scatter*/ 

13 LOG MAX LOG MAX /* MPI Scan*/ 



From all collective communication primitives, only MPI Barrier shows linear be- 
havior. This benchmark was executed changing the number of processors. The results 
obtained show how the execution time grows linearly with the number of processors. 
This can be explained by the fact that being the used MPI implementation based in 
shared memory all processors have to update sequentially a fixed memory position. 



70 - 








30 - 
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1 3 5 7 9 11 13 15 






Fig. 2. Influence of the number of buses on the predicted time (in seconds) for the SendRecv 
(upper-left), Exchange (upper-right), Reduce_scatter (down-left) and Allgather (down-right) 
benchmarks (BW=87.5, L=25, 1 link HD) 



Also this preliminary part of the experiments have shown that setting the number of 
links of each node to one half-duplex link models better the system than if one full- 
duplex link is considered. This is also due to the shared memory MPI implementation, 
where each processor can only be involved in one transfer at a time. 



4.2 System Parameters Optimization 

A first set of experiments was performed to evaluate the influence of the latency and 
bandwidth. For each benchmark a set of simulations was performed (between 70 and 
1 10) varying the latency and bandwidth parameters. 

For these simulations, the number of buses was set to 10. This number approxi- 
mates 0.6 P, being P the number of processors (P=16 for our case). The value 0.6 P is 
an approximation to the maximum bandwidth of a crossbar network. The number of 
links was set to one half-duplex link. For each simulation a predicted (PT) time was 
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obtained. Figure 1 shows the range of bandwidths (in Mbytes/second) such that PT 
has less than 10% error respect DET and the range of latencies (in jiseconds) that 
complies the same previous condition. From these results we can be concluded that, 
for example, a bandwidth of 80 Mbytes/s and a latency of 25|4s can be used for per- 
formance prediction. 

A second set of experiments was performed to evaluate the effects of the number of 
buses. In this case, we set the bandwidth to 87.5Mbytes/s, the latency to 25|4s and 
modeled the connection of the nodes to the network with one half-duplex link, while 
the number of buses is the parameter that varies. The results obtained for PingPing 
and PingPong show that these benchmarks are not influenced by the network conten- 
tion. Predicted time does not change significantly with the number of buses. 

Figure 2 (up) shows the results for this experiment for the Exchange and SendRecv 
benchmarks. We can see that in those cases the predicted time suffers great variation 
depending on the number of defined buses. As the measured DET for SendRecv is 
18,2 secs and for Exchange is 37,4 secs we can conclude that any value between 7 and 
16 for the number of buses will model correctly the network contention. 




Fig. 3. Error (%) in prediction when simulating with BW=87,5, L=25, 1 HD link, 16 buses 

For all collective operations, even MPl Barrier, the results of the experiment were 
similar to those obtained for PingPing and PingPong examples, with gaps in most of 
cases when the number of buses is a power of 2. In figure 2 (down) we can see the 
results obtained for Allgather and Reduce scatter benchmarks. Also, the defined 
number of buses influences the predicted time. As the DET for the Allgather bench- 
mark is 385.1 secs, and for Reduce scatter is 44.0 secs we can conclude that any value 
between 8 and 16 can be used to model bus contention for these cases (the same result 
was obtained for the remaining collective operations). 

Given that the bandwidth of the machine we used is wide enough, we can model 
bus contention with high values. Probably, if a computer with a lower bandwidth had 
been used, we would have to model bus contention with lower values. 

Finally, to validate the correctness of the parameters obtained in the previous ex- 
periments, a third set of experiments was performed. For this series of experiments, 
we run Dimemas for all NAS benchmarks (classes A and B, number of tasks 8-9, 16, 
25-32). Figure 3 shows the percentage of error obtained. Most of the benchmarks are 
predicted with less than a 10% of error. The point out of the graphic takes the value 
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150% error. This error and those over 10% refer to really short executions (less than 
five seconds). Thus the real difference between execution and prediction is negligible. 



5 Conclusions 

In this paper we have presented an approach that takes into account the difference 
between collective and non-collective message passing primitives. A simple but accu- 
rate formulation for the prediction of communication time invested by collective op- 
erations has been defined. This formulation has been included in Dimemas. 

The experiments developed by using an MPI implementation and communication 
intensive benchmarks show the validity of Dimemas for performance prediction of 
message passing programs. 
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Abstract. PVM parallel programming model provides a convenient methodo- 
logy of creating dynamic master/worker applications. In this paper, we 
introduce the benefits from the use of KappaPi tool for automatic analysis of 
master/worker applications. First, by the automatic detection of the 
master/worker paradigm in the application. And second, by the performance 
analysis of the application focusing on the performance bottlenecks and the 
limitations of this master/worker collaboration. 



1. Introduction 

The main reason for designing and implementing a parallel application is to benefit 
from the potential high performance resources of a parallel system [1]. That is to say, 
one of the main objectives of the application is to get a satisfying level of 
performance in the execution. The hard task of building up an application from the 
use of libraries like PVM[2] or MPI[3] should compensate with the result of obtaining 
high performance values, like a fast execution or a good scalability, if not a 
combination of both. 

Unfortunately, obtaining a high degree of performance in an application becomes a 
very hard task. It is necessary to consider many different sources of information like 
the behavior of the programming model used to select the most adequate primitives 
for the program, or the actual details of the parallel machine to understand the effect 
of using certain primitives in the processors and in the communication links. 

These requirements, although taken into account in the programming stages of the 
application, usually require a new stage of performance analysis when the results 
obtained are far from the desired performance values. 

To help in this process of analysing the performance of an application, many tools 
have been presented. Software tools like Paradyn[4], AIMS [5] and P3T[6] have 
introduced some techniques that automatize this process of analysis providing 
information, like, what are the most important performance bottlenecks in the 
execution, and where they are located in the code of the application. 
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In this automatic performance analysis effort, Kappa-Pi (Knowledge-based 
Analyser of Parallel Programs And Performance Improver) was conceived[7]. Kpi 
tool has been designed for the automatic performance analysis of message-passing 
parallel programs. Its purpose is to give users some hints about the actual quality of 
the performance of their applications, together with some suggestions about what 
changes can be applied to the application to improve the performance. 

Kpi tool analyses the execution of parallel applications represented by trace files. 
These trace files are analysed classifying the inefficiency of the application execution. 
Those intervals with most important inefficiency found will be analysed in detail 
looking for the causes of the problem. At the same time, Kpi tool will try to identify 
any source code reference related to the problem found to build an explanation to the 
user [8]. 

Trace files contain all actions happened during the application execution. If traces 
are the only source of information presented to the user, the programmer will have to 
understand the low level information (like communication messages sent and 
received, together with the accumulated times and length); and, from there, abstract 
the important problems of the application. For this reason, Kpi has an internal rule- 
based system that identifies common structures in the execution that are closer to 
programmers’ view. The objective of the rule-based system is to relate the 
recommendations to the actual programming structures used in the application. 

A very common structure in PVM applications is the master/worker paradigm. The 
use of the dynamic process creation and the straightforward use of communication 
primitives (non-blocking send and blocking receive) together with the use of two 
clear roles in the computation, master and worker, define a common application 
structure, rather easy to program in PVM. 

Kpi tool will use its rule-based system to recognise master/worker PVM applica- 
tions. This recognition will allow Kpi tool to analyse the specific performance of such 
collaborations, suggesting possible changes that will improve the efficiency. 
Therefore, all applications that fall into this master/worker collaboration paradigm 
will be specifically studied to obtain suggestions that will specially address the 
master/worker design of the application. 

In section 2 we are going to introduce the principles of the rule-based system used 
by Kpi to classify the execution performance of a parallel application. Section 3 will 
explain what are the special characteristics of a PVM master/worker application. In 
section 4, we give an example of such master/worker applications that will be 
analysed in detail. Finally, section 5 will present the conclusions of this work. 



2, Rule-Based Performance Analysis System 

Kappa Pi initial source of information is the trace file obtained from an execution of 
the application. First of all, the trace events are collected and analysed in order to 
build a summary of the efficiency along the execution interval in study. This 
summary is based on the simple accumulation of processor utilization versus idle and 
overhead time. The tool keeps a table with those execution intervals with the lowest 
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efficiency values (higher number of idle processors). These intervals are saved 
according to the processes involved, so that any new inefficiency found for the same 
processes is accumulated. 

At the end of this initial analysis we have an efficiency index for the application 
that gives an idea of the quality of the execution. On the other hand, we also have a 
final table of low efficiency intervals that allows us to start analyzing why the 
application does not reach better performance values. 



2.1. Automatic Classification of Performance Problems 

The next stage in the Kpi analysis is the classification of the most important 
inefficiencies selected from the previous stage. The trace file intervals selected 
contain the location of the execution inefficiencies, so their further analysis will 
provide more insight of the behavior of the application. 

In order to know which kind of behavior must be analysed, Kpi tool classifies the 
selected intervals with the use of a rule-based knowledge system. The table of 
inefficiency intervals is sorted by accumulated wasted time and the longest 
accumulated intervals will be analysed in detail. Kpi takes the trace events as input 
and applies the set of behavior rules deducing a new list of deduced facts. These rules 
will be applied to the just deduced facts until the rules do not deduce any new fact. 
The higher order facts (deduced at the end of the process) allow the creation of an 
explanation of the behavior found to the user. 

The creation of this description depends very much on the nature of the problem 
found, but in the majority of cases there is a need of collecting more specific 
information to complete the analysis. In some cases, it is necessary to access the 
source code of the application and to look for specific primitive sequence or data 
reference. Therefore, the last stage of the analysis is to call some of these "quick 
parsers" that look for very specific source information to complete the performance 
analysis description. 

This first analysis of the application execution data derives an identification of the 
most general behavior characteristic of the program. In the case of the example 
presented in this work, a master/worker PVM application. 

The second step of this analysis is to use this information about the behavior of the 
program to analyse the performance of this particular application. The program, as 
being identified of a previously known type, can be analysed in detail to find how can 
it be optimized for the current machine in use. 



3, Master AVorker PVM Applications Characteristics 

Master/worker applications allow the easy distribution of computational load. 
Basically, master processes create a data item that must be computed. When this item 
is created it is send to a worker that carries out some computations with it. Optionally, 
a result is brought back to the master when the computation finishes. 
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Normally, workers implement the heavy computation. There are usually some of 
them calculating in parallel the data items sent by the master. The work carried out by 
the master usually consists in lighter weight calculations (data item generation) and 
gathering of final data values. Therefore, the number of running master instances is 
usually not very high. 

Many issues concerning the internal behavior of the master/worker collaboration 
can be solved in different ways. For example, programmers must choose between 
different implementation factors when deciding a synchronization mechanism 
between the master and the workers, number of workers for each master must be 
decided, etc. 

PVM provides a straightforward way of programming a parallel application with 
this paradigm. Typically, a master process spawns the number of workers it is going 
to need. Then, it proceeds to calculate and send the data items until there are no 
available workers (this can be verified using an array of available workers identifica- 
tors). In this moment, the master can wait for any of the workers to finish to send the 
new data item generated. Therefore, it must keep control of which workers are idle 
and which are working. 



4, Master AVorker Identiflcation and Performance Analysis 

To demonstrate the possibilities of Kpi, we have selected a simple application, called 
Xfire, to be used as an example to show: 

How the application is classified as a master/worker collaboration 

How the performance problems of this application are found and analysed to 

derive a suggestion to the programmer of the application. 

The Forest Fire Propagation application (Xfire)[9] is a PVM message passing 
implementation of the “fireline propagation” simulation based on the Andre-Viegas 
model [10] developed for use in any network of workstations. It follows a master- 
worker paradigm where there is a single master process that generates a partition of 
the fireline and distributes it to the workers. These workers are in charge of the local 
propagation of the fire itself and have to communicate the position of the fireline 
limits back to the master. In the next step, the master collects the new fireline 
positions and applies the general model with the new local propagation results to 
produce a new partition of the general fireline (which must be sent to workers to 
calculate the new propagation interval again). 



4.1. Master / Worker Detection 

Considering a three-computer cluster, the fireline is divided in two sections, so two 
workers will be in charge of the local propagation model. Once the instrumented 
binaries of the application code are generated, the application is executed to get a 
trace file that serves as an input to the analysis. 
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The first trace segment is then analysed looking for processor idle or blocking 
intervals at the execution. A typical processor idle interval is the time waiting for a 
message to arrive when calling a blocking receive. All these intervals are identified by 
the ids of the processes involved and a label that describes the PVM primitive that 
caused the waiting time. For instance, in table 1, we represent the most important 
efficiency problems found at Xfire execution. In the table, we place the operation that 
caused the inefficiency and the accumulated processor idle/blocking time (in 
microseconds). The worst problem found is the accumulated time that the master 
process (firemaster) was waiting for fireslave 1 in machine 1 to answer back with the 
local fireline calculation. This inefficiency is the first of those represented in table 1 . 



Inefficiency caused by: 


Accumulated 
Time (fisecs) 


Communication from fireslavel at machine 1 to firemaster at machines 
Communication from fireslave2 at machine2 to firemaster at machines 
Communication from firemaster at machines to fireslavel at machine 1 
Communication from firemaster at machines to fireslave2 at machines 


102.087.981 

5S.900.S71 

18.8SS.645 

14.925.S44 



Table 1. Accumulated time for the most important inefficiencies found in the trace file and the 

event that produced them. 



Once the most important inefficiencies have been found, represented in table 1, it 
is time to identify them. For this purpose, we are using a rule based knowledge system 
that, applied to the events produced while the inefficiency was at highest values, will 
deduce some behavior characteristics that will be useful to analyse the application. 

In figure I, we have shown an example of the sequence of facts deduced to detect a 
master/worker collaboration. The deduced facts shown above express the way that the 
rule-based system deduces more general facts at each step. At the beginning, a rather 
low-level "communication between firemaster and fireslavel" is deduced from the 
send-receive event pairs at the trace file. From the deduced facts, higher level 
constructions can be deduced. For example, "dependency of fireslavel from 
firemaster" that reflects the detection of a communication between both processes and 
a blocking receive at process fireslavel. In this way, similar deductions can be applied 
to other worker processes. The deduction process is then based on the existence of 
previous facts and the application of operators like “and” and “or”. 

The latest fact deduced is the "master-worker" relationship between firemaster 
and fireslavel. This rule depends on the detection of: 

The repeated blocking of the presumed worker process waiting for the master 
(called dependency in figurel). 

The repeated intercommunication between the master and the presumed 
worker (called relationship in figure 1) 

A detailed description of the rules and their meaning can be found at [8]. When the 
facts found lead to the detection of a master/worker application, the analysis 
concentrates in finding the performance limits of this collaboration. 
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Trace file 

timel receive at fireslavel from firemaster at linel 
time2 send from firemaster to fireslavel at line2 
time3 receive at firemaster from fireslavel at line3 
time4 send from fireslavel to firemaster at line4 




Deduced facts from events 

communication from firemaster to fireslavef 
blocked fireslavel (timel < time2) 
communication from fireslavel to firemaster 



Deduced facts from previous facts 

dependency firemaster fireslaveW 
relationship firemaster fireslaveW 



master/worker firemaster, fireslavel 



Fig. 1. Deduction steps necessary to identify master/worker collaborations. The first step is 
based on the trace file events, while the rest use the previously deduced facts. 



4.2. Master/Worker Analysis 

Once this kind of master/ worker collaboration is found, Kpi tool is going to use this 
paradigm identification to evaluate the performance of the current configuration of 
master- workers. Kpi attends to their accumulated waiting times and estimates whether 
it is possible to reduce them in later executions. 

To maximise performance, Kpi estimates the ideal number of workers evaluating 
the ratio between the data generation and the computation rates in a master/worker 
application, considering the processors and intercommunication links’ characteristics. 

For each master/worker iteration, Kpi can measure the wasted time at the master. 
This wasted time is usually spent waiting for the workers to finish their computation, 
so that they can receive more data. 

Wasted time = max (measured computation time per worker) + communication costs 

Where the communication costs assume the actual time to send the initial message 
from the master to the worker and the final message back to the master again. 

To build a suggestion to the programmer, Kpi estimates the load of the calculation 
assigned to each worker (assuming that they all receive a similar amount of work). 
From there, Kpi calculates the possible benefits of adding new workers (considering 
the target processor’s speed and communication latencies). This process will end 
when Kpi finds a maximum estimated number of workers to reduce the waiting times. 
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Figure 2, shows the feedback given to the users of Kpi when the performance 
analysis is finished. The program window is split in three main areas, on the left hand 
side of the screen [statistics] there is a general list of efficiency values per processor. 
On the bottom of the screen [recommendations] the user can read the performance 
suggestion given by Kpi. On the right hand side of the screen[source view], the user 
can switch between a graphical representation of the execution (Gantt chart) and a 
view of the source code, with some highlighted critical lines that could be modified to 
improve the performance of the application. In the recommendations screen, the tool 
suggests to modify the number of workers in the application suggesting three as the 
best number of workers. Therefore, it points at the source code line where the spawn 
of the workers is done. This is the place to create a new worker for the application. 




Fig. 2. Final view of the analysis of the Xfire application 



Inefficiency caused by 


Accumulated time 
(microsecs) 


communication from firemaster at machines to fireslavel at machine 1 
communication from firemaster at machines to fireslaveS at machines 
communication from firemaster at machines to fireslaveS at machined 


51.773.801 

32.101.666 

23.906.561 



Table 2. Waiting results with three workers 
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Following the suggestion shown in figure 2 (suggestion window), we execute the 
Xfire application including a new worker. As seen in table 2, the accumulated waiting 
time for the most important performance problems is reduced from the previous 
execution. Specially, the time wasted in communications arriving at the master 
(blocking receives). 

Additionally, the total execution times have reduced in half from having two 
workers (383 seconds) to three workers (144 seconds). 



5, Conclusions 

Kpi is capable of automatically detect a master/worker collaboration from a general 
PVM application with the use of its rule-based system. Furthermore, the performance 
of such an application will be analysed with the objective of finding which are their 
limits in the running machine. This process has been shown using a forest fire 
propagation simulator. 

This process of finding high level programming structures depends very much on 
the identification of the behavior of a certain code. As there are many different ways 
to program the same application, this identification process will be much more 
successful when there are some recognisable operations in the programming of the 
application. We think that the use of a pre-defined templates, for programming deter- 
minate typical structures, could be very helpful for the process of automatic detection 
and improvement of the application performance. 
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Abstract. In this paper, we focus on performance of point-to-point 
communication and collective operations with mnltiple processes per 
node over an SCI network. By careful matching of sending and receiving 
data, performance close to network peak is achieved for point-to-point 
communication and a variety of collective operations. 



1 Introduction 

A cluster is a machine that consists of a number of workstations (often low- 
cost PCs) interconnected with one or more networks adapters to act as a single 
computing resource. Small S'MPs (symmetrical multiprocessors) currently have 
better price-performance than single processor workstations, and is hence at- 
tractive as the workhorse for clusters. Solving problems on parallel machines 
introduce data exchange. The aggregated data volume exchanged for most ap- 
plications grows with the number of processes. To build scalable clusters, the 
capacity of the interconnecting network must therefore scale with the number of 
workstations in the cluster. 

SCI (Scalable Coherent Interface) j0| is a standardized high-speed inter- 
connect based on shared memory, with the adapters connected in closed rings. 
SCI’s hardware error-checking mechanisms enable reliable data communication 
with minimal software intervention, and hence very low latency communica- 
tion. Dolphin’s SCI to PCI bridge family j2], has hardware support for routing 
traffic between multiple SCI rings connected to the same adapter. Using multi- 
dimensional mesh as network topology enables building of large clusters with 
scalable network performance up to large configurations fP . 

MPI (Message Passing Interface) ^3! is a well-established communication 
standard. The collective communication primitives in MPI cover most com- 
mon global data movement and synchronization primitives. ScaMPI |S| is Scab’s 
thread-hot & -safe high performance implementation of MPI. ScaMPI currently 
runs over local and SCI shared memory on Linux and Solaris for x86-, IA-64-, 
Alpha- and SPARC-based workstations. ScaMPI over SCI has a latency of 6.0 iis 
and a peak bandwidth 90 MByte/s, and 1.7 /js - 320 MByte/s SMP internal (dual 
733 MHz Intel Pentium Ills on an i840 motherboard). 

There are two approaches to utilize a multiprocessor in a message passing 
context. One approach is using threads or a parallelizing compiler, e.g. Posix 
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threads or OpenMP. A more straightforward approach is to use a one-to-one 
mapping between processes and processors. Being thread-hot & -safe, ScaMPI 
can be used with both approaches. This paper focuses on issues regarding mul- 
tiple MPI processes per SMP. The paper presents the work we have done with 
ScaMPI to make basic MPI send and receive SMP aware (section |3). SMP 
aware algorithms for MPI collective operations are then introduced (section 2J) 
and performance on a 16-SMP cluster is presented lsection l4.ll 14.21 fc 14.51 . 

1.1 Used Hardware and Software 

The benchmarked cluster consisted of 16 PCs (Intel 440BX) interconnected with 
Dolphin 32 bit/33 MHz PCI-SCI cards (D311/D312) connected in a 4x4 mesh. 
Each SMP was equipped with dual 450 MHz Pentium HI (Katmai) and 256 
MByte memory, and ran Linux 2.2.14-6smp (patchlevel: #1). 

The point-to-point test was performed using a modified MPICH perftest 0, 
and the collective test was performed using Pallas MPI Benchmarks ^H]. All 
tests were compiled with egcs-1.1.2. 

2 Synchronizing Multiple Senders on the Same SMP 

SCI shared memory is mapped directly into user space and the operating system 
is only used for connection setup, service and error handling HD. Since no OS 
calls are made during normal communication, low latency message passing can 
be achieved. Data from PCI to SCI are internally buffered on the adapter in one 
of eight 64 byte streams to form longer SCI packets 0. The streams are direct 
mapped with respect to the address of its destination. When the highest byte 
in a stream is written, its content is flushed to the SCI network, and stored in 
remote memory. A stream is also flushed if the outgoing datum and data already 
in the stream are from different 64-byte section, or through control registers (to 
force consistency). Processes writing concurrently to remote SCI memory will 
therefore cause massive stream usage conflicts, and poor performance. 

To provide processes with exclusive access to the adapter a global, pre- 
initialized mutex is offered to all processes. The standard Linux mutex imple- 
mentations uses 10-30 fis to switch between process - which is high compared to 
the 6 /iS latency of ScaMPI. Scab’s own lightweight mutex, using a spinlock, is 
therefore used. Since this mutex is not registered by the OS, there is no detec- 
tion if a process terminates while holding the mutex. To avoid deadlock, the SCI 
driver unconditionally unlocks the mutex when SCI memory is released (which 
all processes do on termination). Since processes use active waiting, no context 
switch takes place when the mutex is passed (timing below 1 /xs). 

Raw (unchecked) SCI traffic have latency below 3 /xs and unidirectional band- 
width of 88 MByte/s (bidirectional 97 MByte/s). At MPI level the latency is 
6.0 /xs and peak bandwidth is 86 MByte/s (bidirectional 90 MByte/s). If two 
serialized processes on one SMP send to two processes on another SMP the ag- 
gregated bandwidth increases to 90 MByte/s, while it drops to 29 MByte/s if 
the senders are concurrent. 
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Fig. 1. Aggregated bandwidth for ping-pong communication 



Figure Q shows the aggregated bisection network throughput between two 
SMPs sending messages in a ping-pong pattern {mpptest -bisect -roundtrip ^). 
With one process per SMP the throughput for long messages is 84 MByte/s. 
Throughput increases to 92 MByte/s for two processes with serialized access 
and drops to approx. 55 MByte/s with concurrent access. As described earlier 
the concurrent behavior is unpredictable, hence the ragged performance curve. 



3 Handling Immediate Communication 

Immediate (non-blocking) communication can be implemented in several levels 
of concurrency: 

— Generating separate threads for each call (very resource demanding). 

— One separate thread to handle immediate send and receive requests. 

— Handling all requests within the context of the application thread. 

A dual threaded approach is easy to implement using separate send and receive 
queues synchronized with semaphores. By combining the functionality of the 
two operations into one handler, using polling of the request queues and state 
information, the execution gets less resource demanding. This approach is one 
of the possibilities for immediate handling in ScaMPI {threaded). 

A common request, when running on a large machine, is to get acceptable 
performance with a one-to-one mapping between processes and processors. The 
default immediate handling in ScaMPI is therefore more relaxed. The immediate 
communication requests are queued and handled; as soon as possible, when the 
operation is initiated, when another MPI call is made and is eventually forced 
when MPI_Wait*() is called. 

Figure El shows the aggregated bandwidth for MPI_Sendrecv() between two 
SMPs measured with PMB {PMB-MPIl -multi 0 Sendrecv [IDj). Relaxed han- 
dling with one process per SMP has a latency of 9.4 pLS, while threaded handling 
has 71 fxs. The throughput for relaxed handling is higher than threaded, and 
increases when going from one to two processes per SMP. 
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Fig. 2. MPI_Sendrecv performance between two SMPs (threaded and relaxed 
handling) . 



4 Algorithmic Adaptations for SMP 

There are two ways to reduce network communication time; reducing network 
traffic and improving network throughput. By rearranging data exchange part- 
ners in collective operations, the network data volume can be reduced compared 
to a naive approach. As illustrated in figure Dl SCI communication throughput 
improves with increase in message length. 

A communication group in MPI is defined by a communicator. Each member 
of a communicator is assigned a unique rank from zero to sizeof{comm.)-l. A 
communicator can logically be split in two sub-communicators; global for traf- 
fic between SMPs (one or more) and local for SMP internal traffic (one per 
SMP). Collective communication can generally be performed in the following 
three steps; Data from all processes are first redistributed (gathered) SMP in- 
ternally (local). The processes with data then exchange data between the SMPs 
(global). The resulting data are then finally scattered (broadcasted) within the 
SMP (local). For certain operations, this approach can reduce the network traf- 
fic by a factor equal to the number of processes on the SMP. The number of 
processes concurrently doing global communication should not exceed the num- 
ber of network adapters attached to the SMP. For the rest of the paper a single 
network adapter is assumed. 

With current PCI based interconnects, one process (processor) can saturate 
the bus when sending. Since the data is extracted from the network at the same 
rate it is injected, the PCI bus on the receiver side will be saturated as well. Three 
concurrent senders to the same adapter (assuming fair network arbitration and 
no package rejection) will increase the transfer time by 50% compared to a se- 
quential ordering of the senders! Keeping the sequential algorithms and limiting 
the number of concurrent senders by a mutex will not reduce this wasted time. 
Algorithms therefore has to be specially adapted to coordinate network traffic 
to avoid hot receivers/senders, i.e. matching up senders and receivers in a way 
that leaves as few network adapters as possible idle at any time. This can be 
achieved by serializing the receiving in a token-passing approach passing (zero 
byte messages over the collective communicator) between processes on the same 
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SMP. By algorithmically adapting sends to match receives from the other SMPs, 
a smooth data exchange will take place. 

Due to different starting time of the collective operation and accumulated 
timeskew caused by other activity on the SMP, the symmetric data exchange 
may get disturbed. In eager transfer mode 0, the messages (typically 512 - 64K 
byte) are posted to buffers on the receiver side beyond the receivers control. As 
earlier described this can result in loss of performance due to interference from 
other unrelated transactions. Coordinating the processes to avoid this effect can 
be done either globally or per transaction. A global coordination can be achieved 
by splitting each (or every n'th) transaction step with a barrier. Performance 
improvement has earlier been shown with this approach but it have two 
disadvantages : the exchange can not start until all participants are ready, and 
barrier is usually a costly operation. A per transaction based synchronization can 
with ScaMPI either be done explicit or implicit (self-synchronizing). Long mes- 
sages in ScaMPI are only transferred after a matching receive has been posted 
(transporter transfer mode jS]), and are hence self-synchronized. A simple proto- 
col to match senders & receivers, is for the receiver to send a ready-to-receive 
token to the sender - which waits for this token before sending, and single sender 
at all times is assured. This approach is similar to forcing all transfers to use the 
transporter mechanism. Since ScaMPI latency over an SCI network is only 6 /xs, 
the performance penalty of token passing is acceptable. A send-request is a small 
message in itself, so synchronizing small messages doesn’t reduce concurrency. 

4.1 Barrier 



Algorithm 


PPS 


2 Proc. 


4 Proc. 


8 Proc. 


16 Proc. 


32 Proc. 


Linear gather - scatter 


1 


11.5 


26.2 


63.2 


137.8 


- 


Linear gather - scatter 


2 


5.8 


20.6 


58.4 


141.9 


323.1 


Binomial tree approach 


1 


8.9 


18.2 


29.1 


38.2 


- 


Binomial tree approach 


2 


3.5 


20.9 


36.0 


51.7 


76.2 


Binomial SMP approach 


1 


8.6 


17.6 


29.8 


38.7 


- 


Binomial SMP approach 


2 


5.9 


15.1 


26.2 


43.6 


61.8 


SCI shared memory directly 


1 


7.5 


7.6 


9.9 


17.7 


- 


SCI shared memory directly 


2 


1.3 


5.6 


6.9 


10.0 


21.0 



Table 1. Performance for synchronization over SCI [fis] (PPS = processes per 
SMP). 



As illustrated in tabled barrier can be implemented in several ways. The sim- 
plest is for one process to linearly gather (receive) zero-byte messages from 
all other and then scatter (send) a zero-byte message to all others to indicate 
that the barrier is complete. The linear approach can be replaced with hierar- 
chical trees (Binomial), with timing Log2{^processes) jS]. Using one process 



MPI Optimization for SMP Based Clusters Interconnected with SCI 



61 



per SMP to perform the network barrier, with an SMP internal gather (before) 
and scatter (after) improves performance further (SMP). For the MPI_C0MM_W0RLD 
communicator, ScaMPI uses SCI shared memory direct ^ with even better 
performance (due to a single SCI shared memory commit of all transfers). Tabled 
shows that even short messages can benefit from SMP adaption. 

4.2 Allgather 

In MPI_Allgather 0 all processes send a chunk of data to all others. Aggregated 
application throughput is therefore {^process^ * chunk size) /time. A balanced 
implementation, using MPI_Sendrecv() , is sending cyclic to upstream neighbor 
processes and receiving cyclic downstream. For multi-process per SMP, this ap- 
proach results in multiple senders to each SMP. However, a simple variant of 
this algorithm reduces the active senders and receivers per SMP to one; One 
process per SMP gathers all the data from the SMP, and this process exchanges 
data with gathering processes on the other SMPs, until all data is gathered. 
The resulting data is then broadcasted within the SMP, and the operation is 
complete. 




Fig. 3. MPI_Allgather performance between 16 SMPs with 1 or 2 proc. per 
SMP 



Figure 0 shows the throughput per SMP for MPI_Allgather () between 16 
dual processor SMPs. With one process per SMP, the binomial has as expected 
better performance than the linear. Due to lack of priority of SCI response 
messages in the link-controller Pj, an internal livelock may occur under heavy 
traffic. This is detected by timers and resolved by the device driver, but intro- 
duces unproductive timeslots. Running two unsynchronized processes per SMP, 
some programs get very exposed to this livelock and hence loose additional per- 
formance. 

A positive side effect of gathering all SMP data in one process is that the 
messages over the network increase in size. As shown in figure Q communication 
performance increases with message size. Since SMP internal communication is 
much higher than network performance, the increased SMP internal data traffic 
does not reduce overall performance. 
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4.3 Alltoall 

In all-to-all communication, every process exchange unique chunks of data with 
all others. As for MPI_Allgather(), an intuitive implementation of this is us- 
ing MPI_Sendrecv() to cyclicly exchange data. For multi-process per SMP, this 
approach results in multiple senders to each SMP. Since unique chunks are ex- 
changed between all processes, letting one process do all network communication 
would result in a lot of extra SMP internal traffic. 

A better approach is to coordinate the processes on each SMP by token 
passing, to let them take terms in sending and receiving. Every send is paired 
up with its matching receive. To ensure that only one remote process is sending 
to each SMP, receives are also synchronized, with a ready-to-receive token 
from the receiver to the sender. By this approach, at most one process sends 
data to & from each SMP at any time. 
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Fig. 4. MPI_Alltoall performance between 16 SMPs. 



Figure 0 shows the throughput per SMP for MPI_Alltoall () between 16 
SMPs. With one process per SMP the ordered approach outperforms the simple 
cyclic MPI_Sendrecv() . As in section^2lthe cyclic approach using two processes 
per SMP runs into bad performance, while the ordered maintains performance. 
In MPICH 1^ MPI_Alltoall 0 between N processes is implemented as N im- 
mediate sends and receives, followed by a blocking wait for it all to finish. As 
earlier explained this approach does not perform well with current SCI imple- 
mentations, but by forcing communication to use transporter mode 0, as shown 
with the MPICH** curve in figure EJ performance can be improved. 



5 Conclusion 

Kielmann et al. 0 two rules for high latency networks, seem to have a parallel 
for high bandwidth, low latency networks, e.g. SCI, given by: 

— Concurrent senders to and on the same adapter should be avoided. 

— If the network is the limiting factor for data movement, gathering SMP in- 
ternal data before exchanged over the network may be performance effective. 
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By synchronizing senders and receivers, performance improvements have been 
shown for barrier, allgather and alltoall communication. 



6 Related Work 

Compared to regular binomial tree based communication, m have shown good 
performance for wide-area network. This work has been focused on SMPs with 
a small number of processes, and basic MPI send & receive has been used to 
broadcast/gather SMP-internal data. For larger number of processes per SMP, 
internal copying can be improved with techniques a.k.a. Using user-space 
to user-space copy without temporal buffering, as described in |3|, will improve 
SMP internal data transfer for long messages. Unfortunately this involves a 
special patch to be applied to the Linux kernel, which limits the universality of 
the approach. 
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Abstract. We investigate the stability of liquid bridges by using a paral- 
lel, recursive algorithm. The core of the recursion is the Parallel Simplex 
Algorithm introduced in nm. We discuss the PVM implementation of the 
algorithm and compare our results with earlier published computations. 



1 Introduction 

A new approach to the computation of global stability diagrams is illustrated on 
axi-symmetric equilibria of liquid bridges. The computation method is based on 
a recursive scheme, the core of which is a parallel algorithm. In each successive 
step of the recursion this algorithm calls itself, decreasing the dimension of the 
problem by one. The meaning of the trivial, depth-0 recursion is equivalent to 
the computation of the bifurcation diagram, described in ^j- Computation to 
stability boundaries requires a depth-1 recursion. 

The parallel core algorithm, called the Parallel Simplex Algorithm (PSA) has 
been developed by n,Q, uni with the goal to solve multi-point boundary value 
problems (BVPs) globally, the parallel implementation under PVM is discussed 
in 13 ■ 

Recently, this method has been applied to the equilibria of liquid bridges |0| . 
Also recently, the PSA has been generalized in jO] to a depth-n recursive scheme, 
serving the computation of stability boundaries and other parameter-dependent 
curves. In the current paper we will combine the results of [3 and |3 in order 
to obtain global stability curves for liquid bridges. 

In section 2 we will briefly describe the PSA, section 3 describes the parallel 
implementation, section 4 deals with the recursive scheme. Section 5 summarizes 
the physical background to liquid bridge problems, results are demonstrated in 
section 6. 



J. Dongarra et al. (Eds.): EuroPVM/MPI2000, LNCS 1908, pp. 64-[^ 2000. 
(c) Springer-Verlag Berlin Heidelberg 2000 
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2 The Parallel Simplex Algorithm 

The PSA can be directly applied of two-point Boundary Value Problems (BVPs) 
associated with ordinary differential equations (ODEs). Assuming that the latter 
is of even order (which is most often the case in mechanics), it is equivalent to 
x{t) = f{x{t),X),x £ 3?^", A £ 3?, t £ [0,1]- Let us regroup the equations so 
that the initial (t = 0) conditions apply to the first n components (xi(0) = 
Oi, i = 1, 2, . . . n) and far-end (t = 1) conditions apply to the those with indices 
Vi {xvi{^) = bi, i = l,2,...n), where at,bi are given scalars. We denote the 
remaining initial conditions or variables by Vi-n = cri(O), i = n-l- 1, n-|- 2, . . .2n. 
The (n+ l)-dimensional space spanned by the variables and the parameter A is 
called the Global Representation Space (GRS). Using any convergent forward 
integrator for the initial value problem (IVP), we can compute the final values 

(1), (* = 1, 2, . . . n) as functions of Vi and A: x^^ = gi{vi,V 2 ■ ■ - Vn, A) and then 
solve the algebraic system 

gi{vj,X)-b^ = 0; = 1,2, . . .n, VjG[v°,v}], A£[A°,A^] (1) 

by the PL algorithm (P) in the prescribed (n + l)-dimensional domain of the 
GRS (defined by the constants with superscript in ([[J). Geometrically, (P de- 
scribes the intersection of n hyper-surfaces in the (n -I- l)-dimensional space, 
yielding typically (locally) 1-dimensional solution sets, thus branches. These 
branches will appear as polygons, due to the piecewise linear approximation. 
(We remark that the variables can have a far more general interpretation in the 
PSA; however, the above version is sufficient to introduce the most important 
concepts.) 

Application of the method can be visualized without technical details. System 
m can be resolved simultaneously in any sub-domain of the GRS. ‘Simultaneous 
resolution’ stands in relation to ‘continuation’ as photographic imaging stands to 
free-hand sketching. The pixels of a film negative are developed simultaneously 
(in parallel) in a chemical bath, whereas the hand-sketch requires a sequence 
of strokes with each point in a stroke laid down sequentially. Developing this 
analogy further, bifurcation diagrams obtained by continuation are analogous 
to hand-sketches where the pencil is not permitted to lift whereas simultaneous 
resolution can deliver families of equilibria that are unconnected (e.g. isolas), as 
well. These features make the PSA an optimal candidate for the relatively fast, 
global understanding of low-dimensional bifurcation problems 

3 Implementing the PSA under PVM 

A simple BVP involves 3-5 dimensional GRS, however the complexity of the 
problem grows exponentially with the number of dimensions. In order to solve 
the equation system with prescribed precision we have to choose sufficiently 
small grid-size for the PL algorithm. Supposing that the number of points on 
each coordinate axis is N and the number of dimensions is n, the numbers of 
points where we have to use the forward integrator will be fV". Moreover we have 
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to solve N'^nl equations. This means that the CPU and memory requirements 
of the algorithm grow exponentially with the number of dimensions 0. 

Considering that the GRS can divided into smaller domains, and the com- 
putation in every domain is independent suggests that the domain partitioning 
could be the base of the parallelization. 

The implemented parallel PVM program is based on a master-slave struc- 
ture, where the master program distributes the phase space to smaller pieces 
(domains) and the slaves figure out the equation system in these domains. Since 
the computation effort inside the domains is by orders of magnitudes larger than 
the effort for communication, (i.e. the computation/communication ratio is very 
large) the speedup is almost linear vs. the number of processors, which also de- 
fines the scalability of the software. This has been tested in the range between 
2-120 processors in different hardware and software environment. The major 
functions of the master program are: reading the configuration files, creating the 
domain- list, starting and stopping the slaves, collecting the results from slaves, 
load-balancing, doing checkpoint restart. 

Files are handled only by the master program thus the slaves can run on 
any network connected machine, and NFS is not required. The slave program 
essentially contains the serial version of the described Simplex Algorithm and 
solves the equations in the domain given by the master. 

The load-balancing is provided by the master, because the GRS is divided 
into more domains than the number of processors. When the computation in a 
domain has been finished, the master assigns the next domain to the next free 
slave. In this way faster processors will get more jobs then slower ones. 



4 Recursive Version of the PSA 

The original equation m may contain parameters Ci besides the variables Vi 
and the parameter A. In this case solution sets emerge as multi-dimensional 
manifolds rather than ID-lines. In many applications special, ID subsets of these 
multidimensional manifolds are of real interest. We embedded the PSA into a 
recursive scheme, capable to compute these special lines directly. 

The simplest case of such a recursion (depth-1) is when there is one param- 
eter: Cl . In this case one might ask, how do the extremal points on the solution 
branches (ID lines) of the original problem vary as we change C\. The ’’dumb” 
approach to this question would be to let the PSA solve the original problem for 
many values of Ci, select extrema on all diagrams and then connect them. The 
recursive approach is rather different. 

It regards an extended, n -|- 1 dimensional problem. In order to isolate ID 
branches, one needs one additional function/constraint: this is delivered by the 
condition that we are looking for extrema. While the first n— 1 functions and their 
evaluation remains identical to the original problem, the evaluation of the last, 
added function (extremal condition) requires the solution of an n-dimensional 
problem (in a very small domain) . This concept can be generalized to recursion 
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of arbitrary depth (see |:3)) however, for the current application the depth-1 
recursion is sufficient. 

We have modified our PSA implementation for this requirement, resulting in 
a recursive algorithm, which is a substantial generalization of the original PSA. 
The key idea is simple (described here for depth-1 recursion) : 

— enlarge our equation system m with a new function which yields the required 
properties (eg. turning points, local extrema) at the roots of original equation 
system, 

— set the Cl as a new variable, 

— solve this enlarged equation system using the PSA, described in section 2. 

We remark that the new function may contain virtually any condition on the 
solution branch of the original, n-dimensional problem, including higher deriva- 
tives, singularities, etc. We emphasize that we did not restrict the dimension or 
recursion depth of the algorithm, so our code is capable of solving a large variety 
of problems emerging in applications. 



5 Liquid Bridges and Stability Boundaries 

Liquid equilibria problems are low dimensional; the GRS is only 2-dimensional 
for a wide range of physical situations. Scientific interest in figures of equilib- 
rium can be traced back to the time of PlateauPUj. Mathematicians have been 
stimulated by the minimal surface problem and extensions thereof m- Physical 
chemists have made early computations of shapes and families of shapes. Mo- 
tivation has ranged from improving measurement devices where a meniscus is in- 
volved to measuring surface tension using droplet and bubble methods |l III ;-ij . 
Recent interest from the engineering community has focussed on materials |14lib| 
and micro-gravity applications HHiini. The common feature here is that liquid 
shapes are dominated by surface tension (small capillary length). In these pa- 
pers, by concentrating on different physical aspects (such as effects of gravity, 
asymmetric boundary conditions, etc.) unifying features easily recognized in the 
setting of the GRS have been obscured. Our present goal is to show that the 
PSA can not only utilize parallel computing resources very efficiently in order 
to solve such problems but also can help to provide global stability diagrams. 

Static shapes of surfaces that contain a liquid are governed by the normal 
stress balance across the surface, called the Young-Laplace equation. In the ab- 
sence of gravity, this equation can be reduced to the 2-dimensional system 

d(s) =p-sin(a(s))/r(s) ,, 

f(s) = cos(a(s)). 

where r measures the radius from the axis of symmetry, p is the pressure in the 
liquid, a is the (counterclockwise positive) tangent angle of the meridian with 
respect to the r axis, arclength s is the independent variable ( ' = djds). The 
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Fig. 1. Definition sketches for axisymmetric figures of equilibrium; (a) IVP: 
space-curve geometry or ‘kinematics’; (b) BVP: liquid bridge 



equation i(s) = sin(a(s)), yielding the vertical z coordinate is decoupled from 
(0. We considered pinned boundary conditions (cf. figure ^ with ro = 1) 

c(0) = = 1. (3) 

In order to apply the PSA to this BVP, we have to establish the global coor- 
dinates spanning the GRS. As described in Section 2, these coordinates (vari- 
ables) consist of non-specified initial values and parameters. In our case, the 
only non-specified initial value is a(0) which, with the parameter p, spans the 
2-dimensional GRS. Using the coordinates (a(0),p) the physical rz shape can 
be uniquely reconstructed by forward integration of (0 and the third equation 
given afterwards. Adopting the general notation of the Introduction, we have 
n = 1; Xi = r; X 2 = a, oi = I, vi = 2, vi = a(0), X = p. We are seeking zeroes 
of the function f{ao,p) = sin(a) — L = z — L, defining the global bifurcation 
diagram. Note that L is constant for each separate bifurcation diagram. 

The stability of equilibria can be investigated via investigation of the ex- 
tremal points of the bifurcation diagram. If we look for a family of such curves, 
parametrized by L, we obtain surfaces in the [ao, p,L] space. We are interested 
in points where the partial derivative vanishes. (We remark that extremas 
of the volume are also of interest and can be obtained by a similar procedure.) 
Using the terminology of section 4, we have the additional parameter C\ = L 
and the additional constraint of the vanishing partial derivative. Added to the 
original problem we now have a 2D PSA problem, one function of which requires 
the computation of a ID PSA (in a small domain) . This setting enables us to do 
global search for stability boundaries in the [ao,p, A] extended GRS. 
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6 Results and Conclusions 

We studied the stability of axi-symmetric liquid bridges in the following domain 
of the GRS [ao,p,L\. -1.0 < a(0) < 4.2; -1.5 <p < 3.5; 0.2 < L < 13. We 
tested our code on two different hardware environments: 

(1) the IBM SP2 supercomputer machine at the Cornell Theory Center (CTC) 
and 

(2) the Intel Plll-based cluster at the Technical University of Budapest, Centre 
of Information Technology. 

In both cases we used ’’primary” grids with 500 x 500 x 500 subdivisions and 
refined the results on a secondary grid of triple density. Computation results 
have been filtered based on errors registered in the two function values, stored 
together with the three coordinates of the solution points. On both platforms we 
achieved almost linear speedup factors due to the low communication between 
the nodes. 

We illustrate our results by comparing the result of a typical, 5-hour, 50- 
processor run with the earlier published data of m- As customary, the graphs 
display the normalized volume versus the length L. Curves ACA and HCJ on 
the plots, identified by the labels on fig. 2a, should be compared. These are 
the stability limits to constant pressure disturbances. Curve ACA in fig 2b is 
cut off for L i 2 because the corresponding extrema in the bifurcation diagrams 
fall outside the computational window (p and /alphalimits, see above). Note 
that while our computations show several gaps (due to inadequate meshsize and 
dense filters), they illustrate the global behavior in accordance with |25 and also 
predict interesting behavior at L « 9. The illustrated diagram of IZH has been 
obtained by a semi-automated procedure, based on the computation of individual 
bifurcation diagrams and connecting their extrema. Besides providing a fully 
automated algorithm, our method also yields access to disconnected stability 
curves if they exist. 
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Fig. 2. Global stability chart for liquid bridges (a) as computed in |2I3, (b) 
current results 
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Abstract. We report on experiments with graph algorithms which were designed 
for the coarse grained multicomputer (CGM) model. The implementation was 
based on the message-passing paradigm and uses PVM and/or MPI as communi- 
cation interfaces. 



1 Introduction and Overview 

Parallel graph algorithms are a field that had a rich development since the early begin- 
ning of parallel computation. But if there are a lot of theoretical studies in this area, 
relatively few implementations have been presented for all these algorithms that were 
designed. Moreover most of these implementations have been carried out on specific 
parallel machines (C90, T3E, CM2, CMS, MasPar, Paragon) using special purpose soft- 
ware (Paris, CMIS, NESL, MPL). As far as we know, few implementations use PVM 
or MPI and they only concern the minimum spanning tree and shortest paths problems. 
Eor a state-of-the-art on the implementations of parallel graph algorithms, see |L(^. 

Two questions coming from a wider framework were the starting point of our work: 

1) Which parallel model allows to develop algorithms that are: feasible (they can 
be implemented with a reasonable effort), portable (the code can be used on different 
platforms without rewriting it), predictable (the theoretical analysis allows the predic- 
tion of the behavior in real platforms) and efficient (the code runs correctly and is more 
efficient than the sequential code)? 

2) What are the possibilities and limits of graphs handling in real parallel platforms? 

The portability aspect will obviously lead to results less efficient than the use of 

code fine-tuned for the structure of each machine, but the idea is to obtain results with 
an acceptable efficiency on each machine by using the same code for a given problem. 
Most of the actual parallel machines or networks being distributed memory machines, 
efficient algorithms using message passing-paradigm implemented with portable tools 
like PVM/MPI should lead to such a compromise. 

To give the first answers to these questions, we worked on different algorithmic 
problems for graphs. For each tackled problem, either we chose an existing algorithm or 
we proposed a new one if no algorithm had been designed or if the existing algorithm(s) 
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did not seem adapted to the goals we stated previously. Then, we implemented these 
algorithms. 

In this article we report on the experiences of four studied problems: for each prob- 
lem, we give a quick state-of-the-art on the algorithms and their implementation (if 
existing) for this problem and a brief description of the implemented algorithm with 
its theoretical complexity in time. Then, we present the obtained results from the im- 
plementations on PC clusters using PVM/MPI with the main analysis. If each presen- 
tation can appear succinct, we prefered to give a survey of our works and an idea of 
what can be done on graphs with tools like PVM/MPI. For the reader willing to know 
more detailed studies on the subject, see |TO). Section |2| briefly presents the Coarse 
Grained Multicomputer (CGM) model that seems well adapted for computations ac- 
cording message-passing paradigm and the experimental framework. Section 0 gives 
the results obtained for one of the basic problem that is sorting. Section 0 deals with 
the difficult problem of list ranking. Section|^ shows that it is possible to solve the con- 
nected components problem on dense graph efficiently. Section shows that an algo- 
rithm with log p supersteps {p is the number of processors) can be efficient in practise. 
We give the first answers to the initial questions in SectionCl 



2 The Parallel Model and the Implementation Background 

The CGM Model Recently, several works tried to provide models that take realistic 
characteristics of existing platforms into account while covering at the same time as 
many parallel platforms as possible. Proposed by Valiant , llTCI . BSP (Bulk Synchronous 
Parallel) is the originating source of this family of models. It formalizes the architectural 
features of existing platforms in very few parameters. The LogP model proposed by 
Culler et ah, im considers more architectural details compared to BSP, whereas the 
CGM model initiated by Dehne et ah, |0], is a simplification of BSP. We chose CGM 
because it has a high abstraction that easily enables the design of algorithms and offered 
the simplest realization of the goals we had in mind. 

The three models of parallel computation have a common machine model: a set 
of processors that is interconnected by a network. A processor can be a monoproces- 
sor machine, a processor of a multiprocessors machine or a multiprocessors machine. 
The network can be any communication medium between the processors (bus, shared 
memory, Ethernet, etc). 

The CGM model describes the number of data per processor explicitly: for a prob- 
lem of size n, it assumes that the processors can hold 0{^) data in their local memory 
and that 1 Usually the later requirement is put in concrete terms by assuming that 
P < because each processor has to store information about the other processors. 

The algorithms are an alternation of supersteps. In a superstep, a processor can send 
or receive once to and from each other processor and the amount of data exchanged in a 
superstep by one processor in total is at most 0{j). Unlike BSP, the supersteps are not 
assumed to be synchronized explicitly. Such a synchronization is done implicitly during 
the communications steps. In CGM we have to ensure that the number R of supersteps 
is particularly small compared to the size of the input. For instance, we can ensure that 
R is a function that only depends on p (and not on n the size of the input). 
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The Background We have implemented these algorithms on two PC clusters. Note that 
our code also ran on distributed memory parallel machines. The first clusteiQ, called PF 
henceforth, consists of 13 PentiumPro 200 PCs with 128 MB memory each. The PCs are 
interconnected by a 100 Mb/s full-duplex Fast Ethernet network. The second clustej^, 
called POPC consists of 12 PentiumPro 200 PCs with 64 MB of memory each. The 
interconnection network is a Myrine^ network. The programming language is C-H- 
(gcc) and the communication libraries are PVM and MPI. We began to use PVM release 
3.4.2, but our code now runs with LAM/MPI 6.3 and MPI-BIP (developped for the 
Myrinet network). All the presented results in this article were obtained with PVM 
except for the results of the list ranking that were obtained with MPI (we explain this 
fact Section E|i- Note that this article does not intend to compare the performances of 
PVM and of MPI via the graph algorithms but to show the kind of results we can obtain 
on graphs with the use of message-passing tools like PVM/MPI. 

All the tests have been carried out ten times for each input size. The results given 
are an average of ten tests. All the execution times are in seconds. Each execution 
time is taken as the maximum value of the execution times obtained on each of the p 
processors. In all the given figures, the x-axis corresponds to n, the input size and the 
y-axis gives the execution time in seconds per elements. Both scales are logarithmic to 
make the curves readable. To test and instrument our code we generated input objects 
randomly. Due to space limitations, we omit this description. See |23l for more details. 
The time required for the generation of an object is not included in the times as they are 
presented. 



3 A Basic Operation: Sorting 

The choice of the sorting algorithm is a critical point due to its widespread use to solve 
graph problems. In BSP, there are deterministic as well as randomized algorithms. In 
the CGM setting, all these algorithms translate to have a constant number of super- 
steps. The algorithm proposed by Goodrich, []^]], is theoretically the most performing, 
but is complicated to implement and quite greedy in its use of memory. We chose the 
algorithm of [j5| because it is conceptually simple and requires only 3 supersteps. It is 
based on the sample technique which uses p—\ splitters to cut the input elements in 
p packets. The choice of the splitters is then essential to ensure that the packets have 
more or less the same size. The algorithm is randomized and bounds the packets size 

^ vfcn) probability only. 

In our implementation, the sorted integers are the standard 32 bit int types of the 
machines. We use counting sort, |21, as the sequential sorting subroutine. To distribute 
the data according to the splitters, we do a dichotomic search on p — 1 to find the desti- 
nation packet of each element. By that we only introduce a log p factor. Therefore, this 
sort can be solved with probability 1 — o( 1) in 0{Ts{j) + ^ [log(p — 1)] ) local compu- 



* http://www.inria.fr/sophia/parallel 
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tations per processor, 0{n) for the total communication cost and with 3 supersteps. Ts 
is the complexity of the sequential sort. 




(a) POPC (b) PF 

Fig. 1. Sorting 

Figure [I] gives the execution times per element of the program with 1,2,4 and 8 
PC for POPC, and with 1,2,4, 8 and 12 PC for PF. The right ends of the curves for 
the execution times demonstrate the swapping effects. Measures begin at one million 
elements to satisfy some inequalities required for this sort. The memory of an individual 
PC in PF is two times larger than the one for POPC, therefore PF can sort two times 
more data. As expected, we see that the curves (besides swapping) are near constant in 
n and the execution times are neatly improved when we use 2, 4, 8 or 12 PC. We see that 
this parallel sort can handle very large data efficiently, whereas the sequential algorithm 
is stuck quite early due to the swapping effects. Note that PF can sort 76 million integers 
with 12 PC in less than 40 seconds. 

4 The List Ranking Problem 

The list ranking problem frequently occurs in parallel algorithms that use dynamic ob- 
jects like lists, trees or graphs. The problem is the following: given a linked list of 
elements, for each element x we want to know the distance from x to the tail of the list. 
Whereas it is easy to solve sequentially, it seems much more difficult in parallel. The 
first proposed algorithms were formulated in the PRAM model. In the coarse grained 
models, several algorithms were also proposed, but none of them is communication 
optimal, see Q and ID- As far as we know, few implementations have been realized, 
and none of them seems to be portable because they are highly optimized for the target 
machine, see OUl. 

We proposed a randomized algorithm that uses the technique of independent sets, 
as described in m. It requires 0{\ogp) supersteps, 0{^) for local computations per 
processor and 0{n) for the total communication cost (see @ for a detailed analysis). 
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To not overload the study of the re- 
sults, we only present the experiments on 
POPC, but the results on PF are alike. Fig- 
ure Q gives the execution times per ele- 
ment in function of the list size, p varies 
from 4 to 12, because the memory of the 
processors is saturated when we use 2 or 
3 PC. All the curves stop before the mem- 
ory saturation of the processors. We start 
the measures for lists with 1 million el- 
ements, because for smaller size, the se- 
quential algorithm performs so well that 
Fig. 2. List Ranking on POPC using more processors is not very useful. 

A positive fact that we can deduce from 
the plots given in Figure|2|is that the execution time for a fixed amount of processors p 
shows a linear behavior as expected. One might get the impression from Figure 0that it 
deviates a bit from linearity in n, but this is only a scaling effect: the variation between 
the values for a fixed p and n varying is very small (less than \ps). We see that from 9 
PC the parallel algorithm becomes faster than the sequential one. The parallel execution 
time decreases also with the number of used PC. Nevertheless, the speedups are quite 
restricted (these results are the best we obtained with MPI and are better than those ob- 
tained with PVM). We see also that this algorithm performs well on huge lists. Due to 
the swapping effects, the sequential algorithm dramatically changes its behavior when 
run with more than 4 million elements. For 5 millions elements, the execution time is a 
little bit higher than 3000 seconds, whereas 12 PC solve the problem in 9.24 seconds. 
We see also that we only need 18 seconds to handle lists with 17 millions elements. 




5 Connected Components for Dense Graphs 



Searching for the connected components of a graph is also a basic graph operation. For 
a review of the different PRAM algorithms on the subject see O- Few algorithms for 
the coarse grained models have been proposed. In jU, the first deterministic CGM algo- 
rithm is presented. It requires 0{\ogp) supersteps and is based on PRAM simulations 
and list ranking. According to our experience, it seems that the simulation of PRAM 
algorithms is complex to implement, computationally complex in practice and hardly 
predictable. Moreover, this algorithm uses the list ranking that is really a challenging 
problem as shown previously. On the other hand, the part of this algorithm that is spe- 
cific to CGM doesn’t have these constraints. It computes the connected components for 
graphs where n < ^, that is to say for graphs that are relatively dense, and does this 
without the use of list ranking. Therefore we implemented this part of the algorithm. 
It computes the connected components of a graph with n vertices and m edges such 
that n < ^ in [log/?] supersteps, -t- [log/?]n) local computations per processor 
and 0{\\ogp^n) for the total communication cost. Each of the p processors requires a 
memory of G(^). 
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(a)n= 1000 (b)n=10000 

Fig. 3. Connected components on PF 



We use multi-graphs where two vertices are chosen randomly to form a new edge 
of the graph. The use of multi-graphs for these tests is not a drawback because the 
algorithm touches each edge unless it belongs to the spanning tree only once. For this 
problem, there are two parameters n and m to vary. As the code has the same behavior 
on the clusters we only show the results for graphs with 1000 and 10000 vertices on 
PF. FigureOI gives the execution times in seconds per item with 1,2,8 and 12 PC. For 
n = 1000, m ranges from 10000 to 500000. For n — 10000, m ranges from 10000 to 36 
millions. We see that for a fixed p the curves decrease with m. If we study the results 
obtained with n = 1000 more precisely, we see that when the graph has more than 50000 
edges then there is always a speedup compared to the sequential implementation, and 
the more processors we use, the faster is the execution. With n = 10000, we can do the 
same remark when the graph has more than 1 million edges. Note, that with n = 10000 it 
is possible to handle very large graphs by using several PC. In sequential, the PC begins 
to swap with about 3.2 millions edges, whereas the connected component computation 
on a graph with 36 millions edges can be solved in 2.5 seconds with 12 PC. 

6 Permutation Graphs 

The permutation graph associated with a permutation II is the undirected graph G = 
(V,E) where {i,j} G E if and only if ; < j and II(i) > II(y). Permutation graphs are 
combinatorial objects that have been intensively studied. Basic references may be found 
in Q. This graph problem can also be translated into a computational geometry prob- 
lem called the dominance problem that arises in many applications like range search- 
ing, finding maximal elements, interval/rectangle intersection problems (lO). Passing 
from the permutation to the graph and vice versa is done easily in a sequential time 
of 0{n^). In parallel, we show how to pass from the permutation to the graph in the 
PRAM and their approach easily translates to CGM. This leads to a new compact rep- 
resentation of permutation graphs t il jll V The main step of this algorithm is to com- 
pute the number of transpositions for each value i — Q,...,n— I, i.e. the cardinality 
of {j I i < j and Tt(i) > II(y)}. It requires exactly [log 2 p] supersteps, local 
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computations per processor. The overall communication is in 0{n\\og2P^ ) and is then 
smaller than the local computation cost. 




(a) POPC (b) PF 

Fig. 4. A permutation graph algorithm 

To simplify the implementation and without loss of generality, we assume that p is a 
power of 2. The generated inputs are random permutations. The elements are unsigned 
long integers. Figure 0] shows the execution times in seconds per element for 1,4 and 
8 PC. For PF, the size of the permutation ranges from 100000 to 16 millions, whereas 
for POPC it ranges from 100000 to 8 millions (due to the memory size of the PC on 
each cluster). The right end of the curves show the beginning of the swapping effects. 
First, the curves have the expected behavior: they are constant in n. The execution time 
is also lowered when we use more processors, as expected. Again, it is possible to solve 
this problem on very large data. For PF, one PC begins to swap after 4 millions data, 
whereas 8 PC do it on a little bit less than 17 millions elements. Note that the local 
computations time is greater than the communications time, as expected. 

7 Answers to the Initial Questions 

This work allows us to give some partial answers to the questions we asked at the 
beginning of this paper. 

Question 1) Given the analysis of the results, it seems that coarse grained models 
are very promising to obtain feasible, portable, predictable and efficient algorithms and 
code. If there is still a lot of work left over, these hrst steps go towards practical and effi- 
cient parallel computation. These results are in fact a mix of at least three points: a) these 
models allow to design efficient algorithms based on the message passing-paradigm, b) 
PVM/MPI provide the main routines to efficiently implement the communication steps 
needed in the coarse grained algorithms and c) the portable aspect is ensured on one 
hand by the portability of the software like PVM/MPI and C-H- for our experiments and 
on the other hand by the general structure of the coarse grained algorithms. 

Question 2) This work shows that it is now possible to write portable code for graph 
handling that handles very large data, and that for some problems, this code is efficient. 
The most challenging problem from the point of view of feasibility and efficiency that 
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we encountered is the list ranking problem. It is possible that this singularity comes 
from the specific irregular structure of the problem. Nevertheless, it seems obvious that 
these results can have an impact on many parallel graph algorithms using message- 
passing paradigm that are based on list ranking. 
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Abstract. Adaptive multigrid methods solve partial differential equa- 
tions through a discrete representation of the domain that introduces 
more points in those zones where the equation behavior is highly irreg- 
ular. The distribution of the points changes at run time in a way that 
cannot be foreseen in advance. We propose a methodology to develop 
a highly parallel solution based upon a load balancing strategy that re- 
spects the locality property of adaptive multigrid method, where the 
value of a point p depends on the points that are ’’close” to p according 
to a neighborhood stencil. We also describe the update of the mapping 
at run time to recover an unbalancing, together with strategies to ac- 
quire data mapped onto other processing nodes. A MPI implementation 
is presented together with some experimental results. 



1 Introduction 

Multigrid methods are iterative methods based upon multilevel paradigms to 
solve partial differential equations in two or more dimensions. Combined with 
the most common discretization techniques, they are among the fastest and most 
general methods to solve partial differential equations |S[7|. Moreover, they do 
not require particular properties of the equation, such as the symmetry or the 
separability and are applied to problems in distinct scientific fields Eiiaiia. 

The adaptive version of multigrid methods, AMM, discretizes the domain 
at run time by increasing the number of the points in those zones where the 
behavior of the equation is highly irregular. Hence, the distribution of the points 
in the domain is not uniform and not foreseeable. 

Since the domain usually includes a large number of points, the adoption 
of a parallel architecture is mandatory. We have defined in P0IBI a paral- 
lelization methodology to develop applications to solve irregular problems on 
distributed memory parallel architectures. This paper describes the application 
of this methodology to develop a MPI implementation of the AMM. Sect. 0 
describes the main features of AMM, sect. El shows the MPI implementation 
resulting from applying our data mapping technique to the AMM. Sect. 0 de- 
scribes the technique to gather information mapped onto other processing nodes 
and the problems posed by the adoption of MPI collective communications. The 
experimental results on a Cray T3E are discussed in sect.El 

* This work has been partially supported by CINECA 
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2 Adaptive Multigrid Methods 

An AMM discretizes the domain through a hierarchy of grids built during the 
computation, according to the considered equation. In the following, we adopt 
the finite difference discretization method. For sake of simplicity, we assume that 
the domain belongs to a space with two dimensions and each grid partitions the 
domain, or some parts of it, into a set of squares. The values of the equation 
are computed in the corners of each square. We denote by g{A, 1) the grid to 
discretize a subdomain A at level 1. To improve the accuracy of the discretization 
provided by g{A, 1), a finer grid, g{A, / + 1), that is obtained by halving the sides 
of each square of g{A, 1), is introduced. In this way, at run time, finer and finer 
grids are added till the desidered accuracy has been reached. Even if, in practice, 
the first k levels of the hierarchy are built in advance, to simplify the description 
of our methodology, we assume that the initial grid is one square, i.e. k = 0. 

The AMM iteratively apply a set of operators on each grid in a predefined or- 
der, the V-cycle, until the solution has been computed. The V-cycle includes two 
phases: a descending one, that considers the grids from the highest level to the 
lowest one, and an ascending one, that considers the grids in the reverse order. 
Two versions of the V-cycle exist: the additive and the multiplicative; we adopt 
the additive one and briefly describe the involved operators 0. The smoothing 
operator usually consists of some iterations either of the Gauss-Seidel method 
or the Jacobi one to improve the current solution on each grid. The restriction 
operator maps the current solution on g{A, 1) onto g{A, I — 1). The value of each 
point on g{A, Z — 1) is a weighted average of the values of its neighbors on g{A, 1). 
The prolongation operator maps the current solution on g{A, 1) onto g(A, / -I- 1). 
If a point exists on both grids, its value is copied. The value of any other point 
of g{A,l + 1) is an interpolation of the values of its neighbors on g{A,l). The 
norm operator evaluates the error of the current solution on each square that 
has not been further partitioned. The refinement operator, if applied to g{A, 1) 
adds a new grid g{A, Z -|- 1). 

Our methodology represents the grid hierarchy through a quad-tree, the H- 
Tree. A quad-tree is well suitable to represent the hierarchical relations among 
the squares and it is intrinsically adaptive. Each node N at level I of the H-Tree, 
hnode, represents a square, sq(N), of a grid g{A, 1) of the hierarchy. The squares 
associated to the sons of N, if they exist, represent g{sq{N),l + 1). Because of 
the irregularity of the grid hierarchy, the shape of the H-Tree is irregular too. 
The quad-tree has been adopted in |S|, while alternative representations of the 
grid hierarchy have been adopted in mini. The multigrid operators are applied 
to g{A, 1) by visiting all the hnodes at level I of the H-Tree. All the operators 
are applied to g{A, 1) before passing to g{A, I -|- 1) or g{A, I — 1). 



3 Data Mapping and Load Balancing 

This section describes the load balancing strategies that, respectively, map each 
square at any level of the hierarchy onto a processing node, p-node, and update 
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the mapping during the computation to recover an unbalancing. Both strategies 
take into account two locality properties of an AMM: the value of a point on 
g(A,l) is function of the values of its neighbors i) on the same grid for operators 
such as smoothing and norm (intra-grid or horizontal locality); ii) on g(A,l+l ) (if 
it exists) and g(A,l-l) for the prolongation, restriction and refinement operators 
(inter-grid or vertical locality). In the following, we assume that any p-node 
executes one process and that the np p-nodes have been ordered so that two 
p-nodes close in the interconnection structure of the considered architecture are 
close in the ordering as well. Ph denotes the process executed by the h-th p-node. 

Our methodology defines a data mapping in three steps: i) determination of 
the computational load of each square; ii) squares ordering; Hi) order preserving 
mapping of the squares onto the p-nodes. In the AMM the same load is statically 
assigned to each square, because the number of operations is the same for each 
point and does not change at run time. To preserve the locality properties of the 
AMM, the squares are ordered through a space filling curve built starting 
from the lowest grid of the hierarchy. After a square S in g(A,l), the curve visits 
any square in g(S,d), d > I, before the next square in g(A,l). The recursive 
definition of the space filling curves preserves the vertical locality. Moreover, 
if an appropriate curve is chosen, like the Peano Hilbert or the Morton one, 
the horizontal locality is partially preserved. A space filling mapping has been 
adopted in BED] too. Since each square is paired with an hnode, any space 
filling curve s/defines a visit v(sf) of the H-Tree that returns an ordered sequence 
S{v{sf)) = [A^o, of hnodes. To preserve the ordering among squares, 

S(v{sf)) is mapped onto the ordered sequence of p-nodes through a blocking 
strategy. S{v{sf)) is partitioned into np subsequences of consecutive squares; 
the h-th subsequence includes m/np hnodes and it is assigned to Ph- 

The resulting mapping satisfies the range property: if the hnodes Ni and 
Ni^j are assigned to Ph, then all the hnodes in-between Ni and A^i+j in S{v{sf)), 
are assigned to Ph as well. This property is fundamental to exploit locality. The 
domain subset assigned to Ph, Doh, includes squares at distinct levels of the 
hierarchy. To avoid replicate computations, for each square in Doh, Ph applies 
the operators of the V-cycle to the rightmost downward corner only. 

Our methodology assumes that the whole H-Tree cannot be fully replicated 
in each p-node because of memory constraints. Hence, each p-node stores two 
subtrees of the H-Tree: the replicated H-Tree and the private H-Tree. The private 
H-Tree of Ph includes all the hnodes representing squares in Doh- Even if, in 
general, the squares in Doh may correspond to disjoint subtrees of the H-Tree, for 
sake of simplicity, we assume that Doh is represented by one connected private 
H-Tree only. The replicated H-Tree represent the relation among the private 
H- Trees and the H-Tree. It includes all the hnodes on the paths from the root 
of the H-Tree to the roots of each private H-Tree, and it is the same for each 
process. Each hnode N of the private H-Tree records all the data of the rightmost 
downward corner of sq(N), while each hnode N of the replicated H-Tree records 
the position of sq(N) in the domain and the identifier of the owner process. 
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To determine, during the computation, where the refinement operator has 
to introduce finer grids, the processes estimates the current approximation er- 
ror through the norm operator. This requires the exchange of the local errors 
among all the processes at the end of a V-cycle and the computation of a global 
error. Any process estimates its local error and the global error is computed 
through the MPI_Allreduce primitive. At the end of a V-cycle, to check if the cre- 
ation of finer grids has leaded to a load unbalance, the processes exchange their 
workloads through the MPI_Allgather primitive. Then, each process computes 
maxjunbalance, the largest difference between averageJoad, the ratio between 
the overall load and np, and the workload of each process. If maxjunhalance is 
larger than a tolerance threshold T > 0, then each process executes the balancing 
procedure. T prevents the procedure from being executed to correct a very low 
unbalance. Let us suppose that the workload of Ph is averageJoad + C, C > T, 
while that of Pk, h < k, is averageJoad — C. The balancing procedure cannot 
map some of the squares in Doh to Dok because this violates the range property. 
Instead, it shifts the squares involving each process Pi in-between Ph and Pk- Let 
us define Preci as the set of processes [Pq-.-Pi-i] that precede Pi and Succi as 
the set of processes [Pi+i...Pnp-i] that follow Pi. Furthermore, Sbil{Preci) and 
Sbil{SucCi) are, respectively, the global load unbalances of the sets Preci and 
SucCi- If Sbil{PreCi) = C > T, i.e. processes in PreCi are overloaded. Pi receives 
from Pi-i a segment S of hnodes. If, instead, Sbil{Preci) = C < —T, Pi sends 
to Pi-i a segment S of hnodes whose overall computational load is as close as 
possible to C. The same procedure is applied to Sbil{SucCi) but, in this case, the 
hnodes are either sent to or received from Pi+\. To respect the range property, if 
[iVg....A^j.] is the subsequence of hnodes it has been assigned. Pi sends to Pi-i a 
segment [iVg....iVs], with q < s < r, while it sends to Pi+i a segment 
with q < t < r. Pi communicates with processes Pi-i and Pi+i only. All com- 
munications exploit the synchronous mode with non-blocking send and receive 
primitives. Non-blocking primitives overlap communication and computation, 
while the choice of synchronous mode is due to the MPI implementation on the 
considered parallel architecture, Cray T3E, that provides system buffering. If the 
MPI standard mode is used, a deadlock may occur if a large amount of pending 
non-blocking operations has exhausted the system resources. At the end of the 
load balancing procedure, all the processes exchange, through MPI_Allgather 
and MPI_Allgatherv, the roots of their private H- Trees to update the replicated 
H-Tree. Each process, using MPI_Allgather, declares to any other one how many 
data it is going to send, i.e how many roots it owns. Then, the MPI_Allgatherv 
implements the exchange of the roots through a buffer allocated according to 
the number of roots returned by the MPI_Allgather. 

4 Collecting Data from Other P-Nodes 

Each process Ph applies the multigrid operators, in the order stated by the V- 
cycle, to the points in Doh- While in the most of the cases, any information that 
Ph needs is stored in the private H-Tree, for some points in the border of Doh, 
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Ph has to collect the values of points in squares assigned to other processes. 
We outline the MPI implementation of our remote data collecting procedure, 
denoted informed fault prevention^ where processes exchanges remote data before 
applying the multigrid operators. This procedure allows Ph to receive any data 
it needs to apply the operator op without requesting it to the owner processes, 
before applying op. In this way, when Pi applies op to g{A,l)., it can visit the 
H-Tree in any order because it has already collected the data it needs. The 
advantages of this technique are discussed in P] . 

The informed fault prevention technique consists of two steps: the replicated 
H-Tree extension step, executed at the beginning of each V-cycle, and the fault 
prevention step, executed before each operator in the V-cycle. Let us define 
Boh{op,l) as the set of the squares Si in Doh at level I, such that one of the 
neighbors of Si, as defined by the neighborhood relation of op, does not belong to 
Doh- Furthermore, let Ih{op, 1) be the set of squares outside Doh corresponding 
to the points whose values are required by Ph to apply op to the points in the 
squares in Boh{op,l). 

In the replicated H-Tree extension step the processes exchange some infor- 
mations about their private H- Trees. For each point pi in UopU; Boh{op, 1) such 
that one of its neighbors belongs to Dok, Ph sends to Pk the level of the hnode N 
where sq(N) is the smallest square including pi. This information is sent at the 
beginning of the V-cycle and it is correct until the end of the V-cycle, when the 
refinement operator may add finer grids. Since the refinement operator cannot 
remove a grid, if a load balancing has not been executed, at the beginning of a 
V-cycle each process sends information on the new grids only. 

In a fault prevention step, Pk determines AkIh{op, 1) V/i ^ k, i.e. the squares 
in Dok belonging to Ih{op, 1) by exploiting both the information in the replicated 
H-Tree about Doh and the one received in the replicated H-Tree extension step. 
Hence, Pk sends to Ph, without any explicit request, the values of the points 
in AkIh{op,l). These values are exchanged just before applying op to g(Doh,l), 
because they are updated by previous operators in the V-cycle. Notice that Ph 
can compute Ih{op, 1) by simply merging the subsets AkIh{op, 1) received by its 
neighbors. 

It is worth noticing that, in the case of the refinement operator, AkIh{op,l) 
is approximated. In fact, whether Ph, that owns the square including the point 
p, needs the value of the point q, in a square owned by Pk, depends not only 
upon the neighborhood stencil but also upon the value of the points. Since Ph, 
in the replicated H-Tree extension phase, sends to Pk the depth of the hnodes 
corresponding to the square in UopU; Boh{op, 1), but not the values of the points 
in these squares, Pk cannot determine exactly AkIh{op, 1). To guarantee that Ph 
receives all the data it needs, Pk determines the squares to be sent according to 
the neighborhood stencil only, and it could send some useless values. 

Both steps are implemented through MPI point to point communications. 
Collective communications, i.e. MPI_Scatter, have not been adopted, because 
each process usually communicates with a few other processes. This implies the 
creation, for each process Ph, of one communicator Ch including any neighbor 
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of Ph- To this aim, Ph should determine the set of the neighbors of each process 
in MPI_COMM_WORLD, but it has not enough information to do so. More- 
over, MPI_Comm_split cannot be exploited because the communicators associ- 
ated with two neighbors processes are not disjoint. Furthermore, at the end of a 
V cycle, because of the refinement operator and of load balancing procedure, the 
neighbors of Ph changes; this requires the elimination of old communicators and 
the creation of new ones. Also notice that, since the collective communications 
are blocking, they have to be properly reordered to prevent the deadlock. 

In order to overlap a communication with useful computation, in the fault 
prevention procedure, each process determines the data to be sent to other pro- 
cesses while is waiting for the data from its neighbors. Moreover, data to be 
sent to the same process are merged into one message, to reduce the number of 
communications and the setup overhead. This is a noticeable advantage of in- 
formed fault prevention and it is implemented as follows: each process Pk issues 
an MPIJrecv from MPI_ANY .SOURCE to declare that it is ready to receive the 
sets AkIh{op,l) from any Pk- While waiting for these data, Pk determines the 
data to be sent to all the other processes, i.e. V/i yf /c it computes AkIh{op,l). 
When a predefined amount of data to be sent to the same process has been de- 
termined, Pk sends it using an MPIJssend. Subsequently, Pk checks through an 
MPI.Test the status of the pending MPIJrecv. If the communication has been 
completed, Pk inserts the received data in its replicated H-Tree and it posts 
another MPIJrecv. In any case, the computation of the data to be sent goes on. 
This procedure is iterated until no more data has to be exchanged. After send- 
ing AkIh{op,l) for any h, Pk sends, through np-1 MPIJsend, a syncronization 
message to any other process and it continues to receive data from them. Since 
Pk does not know how many data it will receive, it waits for the syncronization 
message from all the other processes. Then, Pk begins to apply the op to Dok- 
A MPI barrier has not been used to syncronize the processes because it is a 
blocking primitive; hence, after issuing an MPIJIarrier, Pk cannot collect data 
from other processes. A data exchange among a pair of processes involves vari- 
ables with distinct datatypes. In order to merge these values in one message, we 
have compared the adoption of MPUack/ MPI.Unpack against that of derived 
datatype; both techniques achieve similar execution times. 

5 Experimental Results 

We present some experimental results of the MPI parallel version of the AMM. 
The parallel architecture we consider is a Cray T3E; each p-node includes a DEC 
Alpha EV5 processor and 128Mb of memory. The interconnection network is a 
torus. MPI primitives are embedded in the C language. 

We consider the Poisson problem on the unit square in two dimensions, i.e. 
the Laplace equation subject to the Dirichlet boundary conditions: 

I2=]0,l[x]0,l[ 

u = h{x, y) in 5fl 
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tolerance threshold 



Fig. 1. Load balance 



with f{x, y) = 0 and two boundary conditions: 

(*) h{x,y) = 10 {a) h{x,y) = 10cos(27r(a: - + ^)) 

The solution of the Poisson problem is simpler than those of other equations 
such as the Navier-Stokes one. Hence, the ratio between computational work and 
parallel overhead is low and this is a significant test for a parallel implementation. 
The points distribution in the domain in the case of boundary condition (ii) is 
more irregular than the one of (i). In fact, given the same maximum H-Tree 
depth, the final number of hnodes of H-Tree (i) is three times that of H-Tree (ii) 

In order to evaluate the effectiveness of the informed fault prevention tech- 
nique, we have measured that, for both the conditions, the data sent in the 
informed fault prevention are less than 104% than the data required. As previ- 
ously explained, due to the refinement operator, this percentage cannot be equal 
to 100%, but the amount of useless data is less than 4%. 

Fig. Eshows the execution time for different values (in percentage) of the tol- 
erance threshold T. The balancing procedure considerably reduces the execution 
time; in fact, in the worst case, the execution time of an unbalanced computation 
may be 25% higher than the optimal one. However, if T is less than the opti- 
mal value, no benefit is achieved, because the cost of the balancing procedure is 
larger than the unbalance recovered. Fig. Q] also shows that the optimal value of 
T depends upon the points distribution in the domain. In fact, the same value 
of T results in very different execution times for the two conditions; also the 
lowest execution times have been achieved using distinct values of T for the two 
equations. 

Figure 0 shows the efficiency of the AMM for the two problems, for a fixed 
initial grid with k = 7, see sect 0, the same maximum grid level, 12, and a 
variable number of p-nodes. The low efficiency resulting in the second problem 
is due to an highly irregular grid hierarchy. However, even in the worst case, our 
solution achieves an efficiency larger than 50% even on 16 p-nodes. 
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Fig. 2. Efficiency for problems with fixed data dimension 
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Abstract. The unconstrained global programming problem is addressed 
using an efficient multi-start algorithm, in which parallel local searches 
contribute towards a Bayesian global stopping criterion. 

The stopping criterion, denoted the unified Bayesian global stopping cri- 
terion, is based on the mild assumption that the probability of con- 
vergence to the global optimum a;* is comparable to the probability of 
convergence to any local minimum Xj . 

The combination of the simple multi-start local search strategy and the 
unified Bayesian global stopping criterion outperforms a number of lead- 
ing global optimization algorithms, for both serial and parallel implemen- 
tations. Results for parallel clusters of up to 128 machines are presented. 



1 Introduction 

Consider the unconstrained (or bounds constrained) mathematical programming 
problem represented by the following: Given a real valued objective function f{x) 
defined on the set x G D in IR", find the point x* and the corresponding function 
value f* such that 

f* = f{x*) = min{f{x)\x G D} (1) 

if X* exists and is unique. Alternatively, find a low approximation / to /*. 

If the objective function and/or the feasible domain D are non-convex, then 
there may be many local minima which are not optimal. Hence, from a mathe- 
matical point of view, Problem © is essentially unsol vable, due to a lack of math- 
ematical conditions characterizing the global optimum, as opposed to a strictly 
convex continuous function, which is characterized by the Karush-Kuhn- Tucker 
conditions at the minimum. 

Optimization algorithms aimed at solving Problem © are divided in two 
classes, namely deterministic and stochastic. The first class being those algo- 
rithms which implicitly search all of the function domain and thus are guaran- 
teed to find the global optimum. The algorithms within this class are forced to 
deal with restricted classes of functions (e.g. Lipschitz continuous functions with 
known Lipschitz constants). Even with these restrictions it is often computa- 
tionally infeasible to apply deterministic algorithms to search for the guaranteed 
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global optimum as the number of computations required increases exponentially 
with the dimension of the feasible space. To overcome the inherent difficulties 
of the guaranteed-accuracy algorithms, much research effort has been devoted 
to algorithms in which a stochastic element is introduced, this way the deter- 
ministic guarantee is relaxed into a confidence measure. A number of successful 
algorithms belong to the latter class. 

A general stochastic algorithm for global optimization consists of three major 
steps ^ sampling step, an optimization step, and a check of some global stop- 
ping criterion. The availability of a suitable global stopping criterion is probably 
the most important aspect of global optimization. It is also the most problematic, 
due to the very fact that characterization of the global optimum is in general 
not possible. 

Global optimization algorithms and their associated global stopping criteria 
should ultimately be judged on performance. However, when evaluating global 
optimization algorithms, the use of a priori known information about the ob- 
jective function under consideration should be refrained from. For example, the 
termination of algorithms once the known global optimum has been attained 
within a prescribed tolerance complicates the use of these algorithms, and makes 
comparisons with other algorithms very difficult. 

In this paper a number of very simple heuristic algorithms based on multiple 
local searches are constructed. A Bayesian stopping condition is presented, based 
on a criterion previously presented by Snyman and Fatti for their algorithm 
based on dynamic search trajectories 0. The criterion is shown to be quite 
general, and can be applied in combination with any multi-start global search 
strategy. Since the local searches are independent of each other, they are ideally 
suited for implementation on a massively parallel processing machine. 



2 A Global Stopping Criterion 

It is required to calculate /, i.e. 

/= min |/^ over all j to date | (2) 

as the approximation to the global minimum /*. In finding /, over-sampling 
of / should be prevented as far as possible. In addition, an indication of the 
probability of convergence to the f* is desirable. A Bayesian argument seems to 
us the proper framework for the formulation of a such a criterion. Previously, 
two such criteria have been presented, respectively by Boender and Rinnooy Kan 
0, and Snyman and Fatti j2]. 

The former criterion, denoted the optimal sequential Bayesian stopping rule, 
is based on an estimate of the number of local minima in D and the relative size 
of the region of attraction of each local minimum. While apparently effective, 
computational expense prohibits using this rule for functions with a large number 
of local minima in D. 
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The latter criterion is not dependent on an estimate of the number of local 
minima in D or the regions of attraction of the different local minima. Instead, a 
simple assumption about the probability of convergence to the global optimum 
X* in relation to the probability of convergence to any local minimum Xj is 
made. In addition, the probability of convergence to /* can be calculated. 

The rule presented by Snyman and Fatti is derived specifically for their dy- 
namic search method, but is in all probability of greater importance and more 
generally applicable than hitherto realized. In the following, we will show that 
this rule can be used as a general stopping criterion in multi-start algorithms, 
albeit for a restricted class of functions. In doing so, we do not consider the re- 
gion of attraction Rk of local minimum k. Instead, for a given starting point, we 
simply refer to the probability of convergence ak to local minimum fcO Hence- 
forth, we will denote the rule of Snyman and Fatti the unified Bayesian stopping 
rule. 



2.1 The Unified Bayesian Stopping Rule 

Let ak denote the probability that a random starting point will converge to local 
minimum x^ . Also, the probability of convergence to the global minimum x* is 
denoted a*. The following mild assumption, which is probably true for many 
functions of practical interest, is now made: 

d* > dk for all local minima xf^ . (3) 



Furthermore, let r be the number of starting points from which convergence to 
the current best minimum / occurs after fi random searches have been started. 
Then, under assumption 0, the probability that / is equal to /* is given by 



Pr 



f = r 



> q{h, r) = 1 — 



(h -I- a)! (2h -I- b)\ 
(2h -I- a)! {h + b)l ’ 



( 4 ) 



with a = a + b— 1, b=b— r — 1, and a, b suitable parameters of the Beta 
distribution P{a,b). On the basis of ( 0 ) the adopted stopping rule become^: 



STOP when 



Pr 



>q* 



( 5 ) 



where q* is some prescribed desired confidence level, typically chosen as 0.99 - 
0.999. 

^ Studying simple 1-D search trajectories, we observe that the definition of region 
of attraction of a local minimum is problematic. Strictly speaking, the region of 
attraction can only be defined when non-discrete search trajectories (line search or 
other) are employed. 

^ For the sake of brevity, we refrain from presenting a proof for 0 here. The proof 
is similar to that presented in 0. However, we express our proof in terms of the 
probability of convergence to a local minimum, and not in terms of the region of 
attraction of the local minimum. Furthermore, no implicit assumption regarding a 
prior distribution is made. 
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Table 1. The extended Dixon-Szego test set 



No. 


Acronym 


Name 


No. 


Acronym 


Function 


1 


G1 


Griewank G1 


7 


BR 


Branin 


2 


G2 


Griewank G2 


8 


H3 


Hartman 3 


3 


GP 


Goldstein-Price 


9 


H6 


Hartman 6 


4 


C6 


Six-hump camelback 


10 


S5 


Shekel 5 


5 


SH 


Shubert, Levi No. 4 


11 


S7 


Shekel 7 


6 


RA 


Rastrigin 


12 


SIO 


Shekel 10 



3 A Simple Global Search Heuristic 

In all probability, the simplest global optimization algorithm is the combination 
of multiple local searches, combined with some probabilistic stopping criterion. 
Here, we present such a formulation, and utilize ( 0 . We also provide for a global 
minimization step. Various sequential algorithms may be constructed using the 
following framework: 

1. Initialization: Set the trajectory counter j := 1, and prescribe the desired 
confidence level q*. 

2. Sampling steps: Randomly generate Xq G D in IR”. 

3. Global minimization steps: Starting at Xq, attempt to minimize / in a 
global sense by some preliminary search procedure, viz. find and record some 
low point f^<^x^. 

4. Local minimization steps: x^ is used as the starting point for a robust 
gradient based convex minimization algorithm, with stopping criteria defined 
in terms of the Karush-Kuhn- Tucker conditions. Record the lowest function 
value ^ x^ . 

5. Global termination: Assess the global convergence after j searches to date 
(yielding a:^, k = 1, 2, . . . j) using (0. If m is satisfied, STOP, else, j := j + 1 
and goto 2. 

Pure multiple local searches are obtained if Step 3 is excluded, with x^ = Xq. 
We now construct 2 such simple algorithms, namely 

1. LLSl: multiple local searches using the bound-constrained BFGS algorithm 

pin) , and 

2. LLS2: multiple local searches using the unconstrained Polak-Ribiere algo- 
rithm |H|. 

In addition, for both LLSI and LLS2 we add a global minimization phase (step 
3), and denote the respective algorithms GLSl and GLS2. In the global phase 
we simulate the trajectories of a bouncing ball (the MBB algorithm, |Z|), which 
is attractive due to it’s simplicity. The ball’s elasticity coefficient is chosen such 
that the ball’s energy is dissipated very quickly. 
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Table 2. Number of failures of convergence to the global optimum for 100 
(random) restarts of each algorithm for the complete test set. For the problems 
not listed, the number of failures is 0 for all tabulated values of the prescribed 
confidence q*. (Less than 3 failures at q* = 0.95, combined with none at higher 
values of q* are not reported) 



Algorithm Function 


Number of Failures 




q* = .95 q* 


= .99 q* = 


.999 q* = 


.9999 


GLSl 


G1 


27 


18 


6 


5 




G2 


21 


11 


4 


3 




RA 


20 


18 


6 


2 


LLSl 


G1 


39 


17 


8 


4 




G2 


12 


7 


3 


2 




RA 


54 


33 


15 


4 


GLS2 


G1 


16 


12 


1 


0 




RA 


12 


8 


7 


4 


LLS2 


G1 


22 


18 


7 


4 




RA 


15 


12 


7 


2 


SF 0 


G1 


6 


2 


1 


1 




G2 


52 


29 


12 


12 




SH 


54 


43 


20 


18 




RA 


38 


18 


6 


6 



4 Parallel Implementation 

The search trajectories generated in our algorithms are completely independent 
of each other. Hence the sequential algorithm presented in section 3 may easily 
be parallelized. To this extent, we utilize the freely available pvm3 |S| code for 
FORTRAN, running under the Linux operating system. Currently, the massive 
parallel processing virtual machine (MPPVM) consists of up to 128 Pentium III 
450 MHz machines in an existing undergraduate computer lab. 

The distributed computing model represents a master-slave configuration 
where the master program assigns tasks and interprets results, while the slaves 
compute the search trajectories. The workload is statically assigned, and no 
inter-slave communication occurs. The master program informs each slave task of 
the optimization problem parameters by a single broadcast and awaits individual 
results from each slave. 

4.1 A Measure of Computational Effort 

We will assume that our algorithm will ultimately be used in problems for which 
the CPU requirements of evaluating / is orders of magnitudes larger than the 
time required for message passing and algorithm internals. (In structural op- 
timization, for example, each function evaluation typically involves a complete 
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Table 3. Comparison with some other algorithms. For the problems listed, the 
number of function values Nfe for the different algorithms are reported 



Problem 


GLSl 


LLSl 


GLS2 


LLS2 


SF E] 


iroim 


11 211 31 


m 


G1 


2644 


10678 


2992 


7215 


5063 


1822 


396147 


3623 


G2 


1882 


1675 


2398 


1510 


86672 


10786 


828441 


16121 


GP 


454 


229 


403 


471 


2069 


6775 


94587 


7450 


G6 


238 


108 


275 


225 


602 


579 


76293 


3711 


SH 


1715 


1626 


1363 


1485 


93204 


1443 


139087 


3788 


RA 


2893 


1487 


3119 


3161 


45273 


3420 


445711 


2051 


BR 


211 


240 


552 


724 


9553 


594 


71688 


4769 


H3 


289 


199 


478 


462 


1695 


915 


103466 


1698 


H6 


346 


266 


521 


588 


3550 


3516 


106812 


9933 


S5 


315 


479 


353 


607 


6563 


1772 


234654 


1915 


S7 


273 


473 


417 


555 


1848 


1923 


212299 


4235 


SIO 


382 


508 


449 


564 


1604 


2631 


330486 


4226 



finite element or boundary element analysis.) Hence we define a somewhat un- 
conventional measure for the cost of our parallelized algorithm which we denote 
apparent visible cost (Nyc)- This cost represents the number of function evalu- 
ations associated with the random starting point Xq which results in the most 
expensive search trajectory. The time window (in CPU seconds) associated with 
this search trajectory is denoted the virtual CPU time. The virtual CPU time in- 
cludes the time window associated with initialization and evaluation of stopping 
criterion ( 0 . 

5 Numerical Results 

The algorithms are tested using an extended Dixon-Szego test set, presented in 
Tabled The 12 well known functions used are given in, for instance, [^. 

Firstly, Table |21 shows the effect of the prescribed confidence level q* in stop- 
ping criterion ( 0 . The decreasing number of failures of convergence to f* as 
q* increases illustrates the general applicability of the unified Bayesian global 
stopping rule. All of the new algorithms outperform the SF algorithm, for which 
algorithm the stopping criterion was originally derived. 

Table 0 reveals that the simple sequential algorithms presented herein com- 
pare very favorably with a number of leading contenders, namely the Snyman- 
Fatti algorithm |E] , clustering mam, algorithm ‘sigma’ dam and the algorithm 
presented by Mockus PI. All the algorithms were started from different random 
starting points, and the reported cost is the average of 10 independent runs. In 
particular, the results for two very difficult test functions, namely Griewank G1 
and Griewank G2 PI, are encouraging: Few algorithms find the solution to G2, 
(which has a few thousand local minima in the region of interest), in less than 
some 20000 function evaluations. 
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Table 4. Apparent visual cost Nyc for a 32-node parallel virtual machine and 
a 128 node parallel virtual machine. Ny^ may be compared with the number 
of function evaluations N fe of the sequential GLSl algorithm, r represents the 
number of starting points from which convergence to the current best minimum 
/ occurs after fi random searches have been started. The probability that / is 
equal to /* is given by q{n, r) 



Prob. 




GLSl 




32-node pvm 


128-node pvm 


iV/e 


r/n 


q{n,r) 


^VC 


r jh 


q{h,r) 


Nyy 


r/n 


q{n,r) 


G1 


1599 


6/76 


0.9929 


90 


6/96 


0.9929 


30 


7 / 128 


0.9965 


G2 


2122 


6/50 


0.9933 


189 


6/96 


0.9928 


74 


7 / 128 


0.9965 


GP 


341 


5/12 


0.9903 


40 


18 / 32 


1.0000 


39 


59 / 128 


1.0000 


C6 


163 


5/9 


0.9923 


22 


19 / 32 


1.0000 


22 


75 / 128 


1.0000 


SH 


1290 


6/49 


0.9933 


89 


9/64 


0.9993 


50 


17 / 128 


1.0000 


RA 


817 


6/41 


0.9935 


96 


8 / 128 


0.9982 


26 


9 / 128 


0.9992 


BR 


107 


4/4 


0.9921 


78 


31 / 32 


1.0000 


76 


120 / 128 


1.0000 


H3 


207 


5/8 


0.9932 


32 


18 / 32 


1.0000 


32 


77 / 128 


1.0000 


H6 


288 


5/8 


0.9932 


60 


21 / 32 


1.0000 


59 


79 / 128 


1.0000 


S5 


132 


5/8 


0.9932 


22 


6/32 


0.9939 


52 


52 / 128 


1.0000 


S7 


293 


6/17 


0.9953 


25 


14 / 32 


1.0000 


37 


56 / 128 


1.0000 


SIO 


336 


6/17 


0.9953 


32 


11 / 32 


0.9999 


39 


48 / 128 


1.0000 



Finally, Table E| reveals the effect of parallel implementation. For relatively 
‘simple’ problems (viz. problems with few design variables or few local minima in 
the design space), the probability of convergence to the global optimum becomes 
very high when the number of nodes is increased. This is illustrated by, for 
example, the results for the C6 problem. For more difficult problems (e.g. the 
G1 and G2 problems) , the probability of convergence to the global optimum f* 
is increased. 

Simultaneously, the total computational time, (as compared to the sequential 
GLSl algorithm), decreases notably. For the 32-node parallel virtual machine, 
the virtual GPU time to evaluate all the test functions on average decreases by a 
factor of 1.93 (not shown in tabulated form). The time associated with message 
passing is negligible compared to the time associated with the global searches. 

When the time associated with a single function evaluations become much 
larger than the time required for algorithm internals, the fraction Nfg/Nyc based 
on Table 0 may be used as a direct indication of the decrease in virtual com- 
putational time obtainable as a result of parallelization. For the G2 problem, 
this would imply a reduction in computational time by a factor of 28.68 for the 
128-node parallel virtual machine. 
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6 Conclusions 

We have presented a number of efficient multi-start algorithms for the uncon- 
strained global programming problem, based on simple local searches. A salient 
point is the availability of a suitable stopping criterion, which we denote the 
unified Bayesian global stopping criterion. 

Parallelization is shown to be an effective method to reduce the computa- 
tional time associated with the solution of expensive global programming prob- 
lems. While the apparent computational effort is reduced, the probability of 
convergence to the global optimum is simultaneously increased. 
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Abstract. Several generalizations of the flat data parallel model have been 
proposed. Their aim is to allow the capability of nested parallel invocations, 
combining the easiness of programming of the data parallel model with the 
efficiency of the control parallel model. We examine the solutions provided to 
this issue by two standard parallel programming platforms, OpenMP and MPI. 
Both their expression capacity and their efficiency are compared on a Sun EIPC 
3500 and a SGI Origin 2000. The two considered architectures are shared 
memory and, consequently, more suitable for their exploitation under OpenMP. 
In spite of this, the results prove that, under the use of the methodology 
proposed for MPI in this paper, not only the performances of the two platforms 
are similar but, more remarkably, the effort invested in software development is 
also the same. 



I. Introduction 

Data parallelism is one of the more successful efforts to introduce explicit parallelism 
to high level programming languages. The approach is taken because many useful 
computations can be framed in terms of a set of independent sub-computations, each 
strongly associated with an element of a large data structure. Such computations are 
inherently parallelizable. Data parallel programming is particularly convenient for 
two reasons. The first, is its easiness of programming. The second is that it can scale 
easily to larger problem sizes. Several data parallel language implementations are 
available now [2]. However, almost all discussion of data parallelism has been limited 
to the simplest and least expressive form: unstructured data parallelism (flat). Several 
generalizations of the data parallel model have been proposed which permit the 
nesting of data parallel constructors to specify parallel computation across nested and 
irregular data structures [1]. These extensions include the capability of nested parallel 
invocations, combining the facility of programming on a data parallel model with the 
efficiency of the control parallel model in the execution on irregular data structures. 
We examine the solutions provided by two standard parallel programming platforms, 
OpenMP and MPI comparing their expression capacity and their efficiency in two 
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shared memory architectures. The first is a Sun HPC 3500 UltraSPARC II based 
system with 8 processors and 8 Gbytes of shared memory. The second platform 
considered is an Origin 2000 Silicon Graphics, with 64 MIPS RIOOOO processors and 
8 Gbytes of main memory. The two architectures are shared memory and 
consequently more suitable for their exploitation under the shared-variable 
programming model. Despite of this, the conclusion of this work is that, under the use 
of an appropriate methodology not only the performances of the two platforms are 
comparable but, more remarkably, the effort invested in software development is also 
the same. 

From the unlimited scope of applications that benefit from nested parallelism, we 
have chosen the Divide and Conquer technique since it provides an excellent scenario 
for benchmarking. Both the general technique and the particular case that will be 
considered all along the paper are introduced in section 2. The two following sections 
describe in detail the expression of a Nested Parallel Fast Fourier Transform, 
exploiting both data and code parallelism in MPI and OpenMP. The fifth section 
presents the computational results in the two mentioned machines. From these results 
and the comparative study of the codes we elaborated the conclusions in section 6. 



2. Divide and Conquer as a Test Bed for Nested Parallelism 

Let us consider the special case of the divide and conquer approach presented in Fig. 
1 where both the solutions r and the problems x have a vectorial nature. In such case 
there are opportunities to exploit parallelism not only at the task level (line 7) but also 
in the divide and combine subroutines (lines 6 and 8). Thus, data parallelism can be 
introduced by doing every processor in the current group to work in a subsection of 
the array x in the division phase (respectively a subsection of r in the combination 
phase). 



1 procedure pDC(x: problem; r: solution); 

2 begin 

3 if trivial (x) then conquer (x, r) 

4 else 

5 begin 

6 divide (x, Xo, Xi) ; 

7 parallel do pDC(xo, ro) | | pDC(xi, ri) ) ; 

8 combine (r, ro, ri) ; 

9 end ; 

10 end; 



Fig. 1. General frame for a parallel divide and conquer algorithm 

As benchmark instance for this paper we will consider the Fast Fourier Transform 
(FFT) algorithm. However, the proposed techniques have been applied to other divide 
and conquer algorithms with similar results. Consider a sequence of complex numbers 
a^(a[0], ..... a[N-l]) of length N. The Discrete Fast Fourier Transform (DFT) of the 
sequence a is the sequence A^(A[0], .... A[N-1]) given by A[i] = Xk=o..N-i a[k] w'“. 
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where w = e is the primitive «th root of the unity in the complex plane. The 

following decomposition can be deduced from the definition: 

A[i] = Zk=o..N/2-i a[2k] + w Xk^o..N/2-i a[2k+l] 

From this formula, it follows that the DFT A of a can be obtained by combining the 
DFT B of the even components and the DFT C of the odd components of a. 



3, Nested Parallelism in MPI 

The code in Fig. 2 shows a nested implementation of the DFT using MPI [3]. The 
algorithm assumes that the input vector a is replicated onto the initial set of 
processors, while the resulting DFT A is delivered block distributed. For simplicity, 
let also assume that the number of elements N is larger than the number of processors 
p in the initial set, and that both N and p, are powers of 2. Parameter Np holds the 
quotient N/p, W is the vector containing the powers of the primitive «-th root of the 
unity and vector D is used as a temporary buffer during the combination. 



1 void parDandCFFT (Complex *A, Complex *a, Complex *W, unsigned Np, 

2 unsigned stride, Complex *D) { 

3 Complex Aux, *pW; 

4 unsigned i, size; 

5 if (NUMPROCESSORS > 1) { 

6 /* Division phase */ 

7 size = Np*sizeof (Complex) ; 

8 /* Subproblems resolution phase */ 

9 PAR (parDandCFFT (A, a, W, Np, stride<<l, D) , A, size, 

10 parDandCFFT (D, a+stride, W, Np, stride<<l. A), D, size); 

11 /* Combination phase */ 

12 for(i = 0, pW = W+ (Np*NAME*stride) ; i < Np; i++, pW += stride) 

{ 

13 Aux. re = pW->re * D[i] .re - pW->im * D[i] .im; 

14 Aux.im = pW->re * D[i] . im + pW->im * D[i] .re; 

15 A[i] .re += Aux. re; 

16 A[i] . im += Aux.im; 

17 } 

18 } 

19 else 

20 seqFFT(A, a, W, N, stride, D) ; 

21 } 



Fig. 2. Nested parallel DFT implementation using MPI 

The key point in this code is the use of the macro PAR in line 9. The call to macro 
PAR(fi, pj, S],f 2 , p 2 , S 2 ) is expanded to the code showed in Fig. 3, dividing the current 
group of processors in two subgroups. Each processor is assigned to a subgroup, and 
the replicated variable NUMPROCESSORS holds the number of processors in its 
group. While the first subgroup executes function /; (line 7 in Fig. 3), the second one 
does the same with function (line 14 in Fig. 3). After that, the two subgroups 
exchange the results of their computations (lines 8 and 15 in Fig. 3), which are 
constituted by Sj bytes pointed by /?,. This exchange is done in a pair-wise manner, in 
such a way that each processor in one of the subgroups sends their results to its 
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corresponding partner in the other subgroup. Variable ll _partner indicates the 
processor in the other subgroup that holds the corresponding elements to be combined 
with. Variables NAME and LL_NAME contain, respectively, the logical processor 
name in the current group and the physical processor name. When this is done, the 
subgroups rejoin to the original one. This methodology can be straightforwardly 
expanded for non-binary divisions. 

The trivial case in this D&C algorithm is reached when only one processor stays in 
a group. This case is treated by the procedure seqFFT (line 20 in Fig. 2), which is 
simply the result obtained serializing the code in Fig. 2. While the division phase has 
been reduced to a simple variable initialization (line 7), the combination phase can be 
done cooperatively by all the processors in the group. This is possible because partner 
processors can perform a symmetrical computation using the appropriated subset of 
elements from the replicated vector W. These elements are separated stride positions 
and are pointed hyp Win the combination loop (line 12). 



1 #define PAR(fl, rl, si, f2, r2 , s2) { \ 

2 unsigned 11 partner; \ 

3 MPI_Status status; \ 

4 NUMPROCESSORS >>= 1; \ 

5 ll_partner = LLNAME ^ NUMPROCESSORS; \ 

6 if ( (NAME & NUMPROCESSORS) == 0) { \ 

7 fl; \ 

8 MPI_Sendrecv(rl, si, MPIBYTE, ll_j)artner, NUMPROCESSORS, \ 

9 r2, s2, MPIBYTE, ll_partner, NUMPROCESSORS, \ 

10 MPI_COMM_WORLD, &status) ; \ 

11 } \ 

12 else { \ 

13 NAME &= (NUMPROCESSORS- 1) ; \ 

14 f2; \ 

15 MPISendrecv (r2 , s2, MPIBYTE, ll_partner, NUMPROCESSORS, \ 

16 rl, si, MPIBYTE, ll_partner, NUMPROCESSORS, \ 

17 MPI_COMM_WORLD, &status) ; \ 

18 NAME 1= NUMPROCESSORS; \ 

19 } \ 

20 NUMPROCESSORS <<= 1; \ 

21 } 



Fig. 3. Macro PAR implementation using MPI 

Although the natural way to express nested parallelism in MPI is through the use of 
communicators and the function MPI _Comm_spUt, it carries a considerable overhead 
since its execution implies communications. Fig. 4 presents the results of comparing 
the time taken by MPI_Comm_spUt on a CRAY T3D with different number p of 
processors (curves labeled MPI-p). For each value of p, the experiment consisted in 
the repetition of N iterations (represented in the horizontal axis) of a loop partitioning 
the current communicator. The three curves labeled PAR-p show the time taken when 
the division (and reunification) is performed using the alternative division technique 
proposed above. They appear overlapped in the X-axis since the times they took are 
negligible compared with the time needed by the MPI_Comm_spUt version. 





100 



J.A. Gonzalez et al. 






-MPI-256 


■ 


-MPI-128 


-A- 


-MPI-64 


— X- 


-PAR-256 




-PAR- 128 


•— 


-PAR-64 



Number of calls (xlOOO) 



Fig. 4. The cost of MPI_Comm_split vs. PAR 



4, Nested Parallelism in OpenMP 

OpenMP [5] defines a set of compiler directives, library routines and environment 
variables that extend Fortran API and separately C and C++ API [4] to express shared 
memory parallelism. Support for nested parallelism is included in the standard. If a 
thread in a team executing a parallel region reaches another parallel construct, it 
creates a new team and it becomes the master of that new team. Nested parallel 
regions are serialized by default. As a result, a team composed of only one thread 
executes a nested parallel region. The default behavior may be changed using either 
the runtime library function omp_set_nested or the environment variable 
OMP NESTED, in which case the number of threads in the team is implementation 
dependent. However, when a nested parallel region is reached, an implementation is 
always allowed to create a team composed of only one thread. Unfortunately this is 
the common case for most commercial OpenMP implementations. 

Fig. 5 presents an implementation of the DFT algorithm using OpenMP. This 
algorithm works with the same assumptions that algorithm presented in previous 
section. In line 17, the directive parallel opens a parallel region and the use of the 
clause firstprivate ensures that each thread in the team takes its own initialized copy 
of the variables n, B and C, while the rest of the variables are shared by default. Each 
recursive call to the routine ompFFT is enclosed as a section in the work-sharing 
construct sections (lines 19-25). As a result, each of them will be executed once by a 
thread in the team. In the first call, all available threads will compose the team in the 
parallel region but in the subsequent nested calls the number of threads in a team is 
implementation dependent, and following the standard specifications, a compiler is 
allowed to assign only the current thread to the corresponding team. There is an 
implicit barrier at the end of the construct sections that is needed before performing 
the combination phase (lines 28-39). The combination is also executed in parallel by 
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the work-sharing construct for, which makes a partition in the set of iterations. The 
clause schedule(static) forces a block distribution of equal chunk size among all 
threads in the team. After that, the parallel region is closed and the execution 
continues in a sequential way. 
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void ompFFT (Complex *A, Complex *a, Complex *W, unsigned 
unsigned stride, Complex *D) { 

Complex *B, *C; 

Complex Aux, *pW; 
unsigned n; 
int i ; 

if(N == 1) { 

A [ 0 ] . re = a [ 0 ] . re ; 

A [0] . im = a [0] . im; 

} 

else { 

/* Division phase */ 
n = (N >> 1) ; 

B = D; 

C = D + n; 

/* Subproblems resolution phase */ 

#pragma omp parallel f irstprivate (n, B, C) 

{ 

#pragma omp sections 

{ 

#pragma omp section 

ompFFT(B, a, W, n, stride<<l, A) ; 

#pragma omp section 

ompFFT(C, a+stride, W, n, stride<<l, A+n) ; 

} 

/* Combination phase */ 
pW = W; 

#pragma omp for private (Aux) 

f irstprivate (pW) 
schedule (static) 

for(i =0; i < n; i++) { 

Aux. re = pW->re * C[i] .re - pW->im * C[i] .im; 
Aux.im = pW->re * C[i] . im + pW->im * C[i] .re; 

A [ i ] . re = B [ i ] . re + Aux . re ; 

A[i] .im = B[i] .im + Aux.im; 

A[i+n] .re = B[i] .re - Aux. re; 

A[i+n] . im = B[i] . im - Aux.im; 
pW += stride; 




N, 



Fig. 5. FFT implementation using OpenMP 

Recent works have pointed out the necessity of more flexible control structures that 
allow OpenMP to handle common programming idioms like recursive control and list 
or tree data structures. Extensions to the standard have been proposed, as the 
Workqueuing model [6] and other for groups creation [1]. The Workqueuing model 
introduces two new directives to OpenMP: taskq and task. The first causes an empty 
queue to be created and a single thread executes the code inside the taskq block. The 
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second one specifies a unit of work, to be enqueued in the queue created by the 
enclosing taskq block, which can be dequeued and executed by any thread. Taskq 
directives may be nested inside each other, generating a logical tree of queues. On the 
other hand, in [1] two new clauses are proposed: groups and onto. Groups can be 
applied to the work-sharing constructs for and sections and allows creating sets of 
specified number of threads. The second proposed clause permits to assign each 
section to a previously created group. In this case, nesting of work-sharing constructs 
could be handled by subgroups creation. Both extensions have available running 
implementations (http://www.kai.com/parallel/kappro/), (http://www.cepba.upc.es/ 
nanos.html). 

Alternatively we can still implement the FFT algorithm using the schedule thread 
methodology presented in the MPI section. 

5, Comparative: Expression Capacity and Efficiency 

The experiences were carried out in the Sun HPC 3500 UltraSPARC II based system 
at Edinburgh Parallel Computing Centre (8 processors and 8 Gbytes of shared 
memory) and the Origin 2000 Silicon Graphics, (64 MIPS RIOOOO processors and 8 
Gbytes of main memory) at European Center for Parallelism of Barcelona. The 
OpenMP implementation used in the Sun was the delivered by Kuck and Associated 
Inc (KAI). The Origin compiler version was the MIPSpro 7.30. Both MPI libraries 
were the native implementations. Experiments were carried out with different vector 
sizes. Respectively, Table 1 and Table 2 present the measured time with a vector size 
of 1 million elements on both platforms. 



PROCS 


OMP 


OMP_LL 


MPI 


1 


- 


4.756 


4.719 


2 


- 


2.615 


2.733 


4 


- 


1.423 


1.593 


8 


- 


0.883 


0.919 



Table 1. FFT. 1 million elements. Sun HPC 3500. 



Column labeled MPI corresponds to the code presented in Fig. 2, while columns 
labeled OMP and OMP LL respectively correspond to code presented in Fig. 5 and 
the improved OpenMP implementation using the methodology described in section 3. 



PROCS 


OMP 


OMP_LL 


MPI 


1 


10.772 


9.763 


9.091 


2 


9.713 


7.214 


7.201 


4 


9.028 


5.806 


3.604 


8 


7.173 


4.378 


2.211 



Table 2. FFT. 1 million elements. SGI Origin 2000. 
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Although the Silicon compiler detects the existence of nested parallelism, it is 
unable to exploit it. The little improvement observed as processor number increases is 
due to the data parallelism exploited during the combination phase. Even worse, the 
KAI compiler does not generate the correct code and the corresponding column in 
Table 1 is empty. 

It can be observed that the MPI and OpenMP LL times on the Sun Multiprocessor 
are comparable. Once the macro PAR is encapsulated in a header file, both source 
codes are almost the same. However, the OpenMP LL scalability is worse than the 
MPI one on the SGI Origin. The explicit locality of the MPI version seems to have an 
impact in a shared distributed architecture. 



6, Conclusions 

Both the Sun HPC 3500 and the SGI Origin 2000 are shared memory architectures 
and hence more appropriate for their exploitation under OpenMP. Although OpenMP 
allows the explicit expression of nested parallelism, the current implementations are 
unable to take advantage of it. On the other hand, the use of directives to implement 
work-shared constructs makes OpenMP more suited than MPI to exploit data 
parallelism. The use of the methodology presented in section 3 for MPI allows the 
same high level of expression than the exemplified in Fig. 5 without paying any 
penalties in the performance. 
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Abstract. The portability of parallel programs has involved lot of effort during 
the last decade. PVM and MPI have greatly contributed to solve this drawback 
and nowadays most parallel programs are portable. However, the portability of 
the efficiency suffers, in many cases, from inherent effects of the target archi- 
tectures. The optimal mapping of a parallel program is strongly dependent on 
the granularity and network architecture. We broach the problem of finding the 
optimal mapping of pipeline MPI programs. We propose an analytical model 
that allows an easy estimation of the parameters needed to obtain the mapping. 
The model is capable to be introduced into tools to produce this mapping auto- 
matically. Both the accuracy of the model and the optimal efficiency of the al- 
gorithm found are contrasted on a pipeline algorithm for the Path Planning 
Problem. 



1 Introduction 

Many pipeline algorithms show an optimal behavior when they are just considered 
from the theoretical point of view in which so many processors as the number of in- 
puts are available. However, most of them lain poorly when they are executed over 
current architectures. The implementation of pipeline algorithms on a target architec- 
ture is strongly conditioned by the actual assignment of viiTual processes to the physi- 
cal processors and their simulation, the granularity of the architecture, and the instance 
of the problem to be executed. To preserve the optimality of the algorithm, a proper 
combination of these factors must be considered. 

Several software approaches to solve this problem have been provided by different 
authors. Although HPF is a data parallel language, the version 2.0 [5] approved exten- 
sions introduce the constructs to express pipeline parallelism. However, the only ex- 
isting HPF implementation conforming to the extensions [3] does not deal with the 
optimal mapping of these algorithms. The same occurs with P3L, an skeleton oriented 
language allowing the expression of pipelining and its combination with other para- 
digms as farming, prefixes, etc. [4]. This absence of software contrasts with the 
amount of theoretical works. Most of them solve the case under particular assump- 
tions. Good solutions are known for the case when the computation occurring between 
successive communications is constant [1], [2], [9]. The general approach followed in 
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those works consists of finding a cost model. This model leads to an optimization 
problem whose solution for some particular cases can be analytically expressed. Un- 
fortunately, the inclusion of the former methodologies in a software tool is far of being 
a practicable task. 

The Up tool presented in [8] allows cyclic and block-cyclic mapping of pipeline al- 
gorithms according to the user specifications. Lip is conceived as a macro based li- 
brary and its portability is guaranteed since is built on top of standard message passing 
libraries (MPI, PVM). We have provided it with a buffering functionality and it is also 
an objective of this paper to supply a mechanism that allows Up to generate automati- 
cally the optimal mapping. 

In section 2 we discuss some issues associated to the problem of mapping a virtual 
pipeline into a physical ring of processors. We introduce the necessities for finding an 
optimal grain of processes, an optimal buffering of data and an efficient virtualization. 
In section 3 the Path Planning Problem is formulated and a pipeline dynamic pro- 
gramming algorithm is described. In section 4 we propose an analytical model that 
allows an estimation of the parameters involved in the optimal mapping. The accuracy 
of the model is contrasted with the Path Plannning Problem in section 5. According to 
the computational experience, the numerical approach followed with the estimation of 
the parameters shows an acceptable error in the prediction. As we conclude in section 
6, this numerical approximation is suitable to be introduced in a tool that automati- 
cally generates the optimal mapping. 

The computational experience has been developed under a Cray T3E. A 3 dimen- 
sional torus with distributed memory on shared address space. It has 16 DEC 21 1 164 
processors. Each processor has 128 Mb of memory and reaches 600 Mflops. 



2 The Problem 

The mapping problem is defined as finding the 
optimal assignment of computations to proces- 
sors to minimize the execution time. We con- 
sider that the code executed by every virtual 
process of the pipeline is the standard loop of 
figure 1 . In the loop that we consider, bodyo take 
constant time while bodyi and body 2 depends on 
the iteration of the loop. The code of figure 1 
represents a wide range of situations as is the 
case of many parallel Dynamic Programming 
algorithms [6], [10]. 

The virtual processes running this code must 
be assigned among the available processors. This 
is the problem of finding an efficient mapping of 
the virtual pipeline on the actual parallel machine. The classical technique consists of 
partitioning the set of processes following a mixed block-cyclic mapping depending 
on the Grain G of processes assigned to each processor. Implementation is achieved 



void f ( ) { 

Compute (bodyO) ; 
While (running) 

{ 

Receive ( ) ; 

Com- 
pute (bodyi) ; 

Send ( ) ; 

Com- 
pute (body2 ) ; 

Fig. 1. Standard loop on a pipe- 
line algorithm. 
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on a one way ring topology where the first and last processor are connected through a 
buffering process. 

The straightforward technique to obtain block-cyclic mapping consisting of per- 
forming G sequential executions of the / function in each processor is not a good ap- 
proximation. The delay introduced by each processor produces a parallel algorithm as 
slow as the sequential one. To obtain an efficient implementation, processors must 
start to work as soon as possible and they must be feeded with data when needed. 
Efficient implementations are obtained when the context switch between processes is 
performed every time a process communicates a value. 

We can now formulate a first question: Which is the optimal value for G? 

Another important factor is how data are communicated between processors. It 
must be taken into consideration that we are dealing with intensive communication 
applications. According to the granularity of the architecture (the ratio of the time 
required for a basic communication to the time required for a basic computation) and 
the grain size G of the computation, it is convenient to buffer the data communicated 
into the sender processor before an output be produced. When the outputs fill the size 
of the buffer, data are sent as a single packet. Buffering data reduces the overhead in 
communications but can introduce delays between processors increasing the startup of 
the pipeline. The size B of the Buffer is an important parameter to be considered when 
mapping a pipeline algorithm. 

We introduce the second question: Which is the optimal value for B? 

An experimental analysis of the problem imposes a wide range of executions vary- 
ing the grain G, the size of the buffer B and the number of processors p. The amount 
of parameters involved force to the building of tools that simplify the effort invested 
by the programmer in the development. 

La Laguna Pipeline [8], Up, is a general purpose tool for the pipeline programming 
paradigm. The user specifies the code for the first processor, for the last processor and 
the code for a general processor (code of figure 1) of a virtually infinite pipeline. Lip 
enrolls the virtual pipeline into a simulation loop according to the mapping policy 
specified. Lip supports cyclic, blocking and blocking-cyclic policies. The block-cyclic 
mapping is implemented using the Unix standard library “setjmp.h”. This library al- 
lows unconditional jumps to variable labels. The tool differentiates between internal 
and external communications of the virtual processors. On every internal communica- 
tion a context switch is performed using the functions setjmpQ and longjmpQ of the 
former library. Since the context switch is implemented very efficiently, the overhead 
introduced by the tool on a block-cyclic mapping is minimum. 

Lip also provides a directive to pack the data produced on the external communi- 
cations into a single buffer-packet. The user can specify the number of elements B to 
be buffered and fits the size of the packet according to the communication parameters 
(latency and bandwidth) of the network on the target architecture. In [8] we show that 
the performances obtained with Up are similar to those obtained by an experienced 
programmer using standard message passing libraries. 
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3 The Path Planning Problem 

As an example to illustrate the importance of a good election of the parameters G and 
B, we are going to consider the Path Planning Problem (PPP). A map is an m * n grid 
of positions for some positive integers m, n. The eight neighbours of a position p are 
indicated by the corresponding cardinal point in the compass (fig. 2-a). 



> 




I I I I I I ^ 

a) Eight neighbours of a position P b) Red Sweep c) Blue Sweep 



Fig2. Neighborhood dependencies on the PPP. 

Each position p is associated with a non-negative real-number tc(p) corresponding 
to the transversability cost of the position. Given a position p and a neighbour q of p, 
the edge (p, q) is weighted with a cost c(p, q) = (tc(p)+tc(q))/2 if q e (N, S, W, E} and 
c(p, q) = (tc(p) + tc(q)) * V 2 /2 otherwise: the V2 multiplier reflects the added travel- 
ling distance due to the diagonal connection. Given a position, called the source, we 
want to compute the shortest path (or minimum-cost path) from it to every position in 
the map. A dynamic programming algorithm to solve the problem has been proposed 
in [7]. The algorithm performs a succession of red and blue sweeps of the map. On the 
red sweep, a forwarded scan of the map M in the row-major ordering is performed. 
Each position p is updated according to the red mask depicted in fig 2-b. On the blue 
sweep, a reversed scan of the map in the row-major ordering is performed. Each posi- 
tion p is updated according to the blue mask depicted in fig 2-c. The best-known cost 
f(p) in the red and blue sweep is updated according to formula (I) and (II), respec- 
tively. The red and blue sweeps are performed alternatively until no values are 
changed in one sweep. A general stage of the parallel algorithm appears in figure 3. 
f(P) = min if(P),f(W)+c(W, P),f(N)+c(N,P).f(NW)+c(NW, P), f(NE)+c(NE, P)f-(I) 

f(P) = min if(P).f(E)+c(E P).f(S)+c(S,P).f(SW)+c(SW, P),f(SE)+c(SE, P) Ub) 

Fig. 4 presents the running times obtained for G ranging from 1 to 32 and B varying 
from 1 to 512, on each number of processors. Values for G and B out of these ranges, 
produce worst running times. The test problem is a 1024*1024 map. When the num- 
ber of processors increases the product G times B must be reduced to decrease the 
contention of initialisation while keeping a low latency in communications. The mini- 
mum is in a valley that moves depending on the number of processors. 
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4 The Analytical Model 

Given a parallel machine, we aim to find an analytical model to obtain the optimal 
values of G and B for an instance of a problem. This problem has been previously 
formulated by [2] using tiling. The size of the tiles must be determined assumed the 
shape. However, the approach taken assumes that the computational bodies 0 and 2 in 
the loop are empty and body 1 takes constant time on each iteration. Also, the consid- 
erations about the simulation of the virtual processes are omitted. 

Obviously, the analytical model involves both the parallel algorithm and the paral- 
lel architecture. The time that elapses from the moment that a parallel computation 
starts to the moment that the last processor finishes executions has to be modeled. 
When modeling interprocessor communications, it is necessary to differentiate be- 
tween external communication (involving physical processors) and internal communi- 
cations (involving virtual processors). 

void solve_PPP() { 
int j , X, f [MAX_n] ; 
for (j = 0; j < n; j++) 
switch ( j ) { 
case 0 : 

IN (&x) ; 

f [0] = cost (N, f [0] , x) ; 
f [1] = cost (NW, f [1] , x) ; 
break; 
case n-2 : 

IN (&x) ; 

f [n - 2] = cost(NE, f [n - 2], x) ; 

OUT(&f[n -2], 1, sizeof (int) ) ; 
break; 
case n-1: 

f [n - 1] = cost (N, f [n - 1] , x) ; 

f [n - 1] = cost (W, f [n - 1] , f [n - 2] ) ; 

OUT(&f[n -1], 1, sizeof (int) ) ; 
break; 
default : 

IN (&x) ; 

f[j-l] = COSt(NE, f[j-l], x) ; 
f [ j ] = cost (N, f [ j ] , x) ; 
f[j] = COSt(W, f[j], f [j - 1] ) ; 
f [j + 1] = cost (NW, f [j + 1] , x) ; 

OUT(&;f[j -1], 1, sizeof (int) ) ; 
break; 

} /* solve_PPP */ 



Fig. 3. Pipeline algorithm for the PPP. Up code of a general stage. 
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Hrobism piu:^422.ppp - np 2 Problem p1024z2 ppp - np 4 







Fig. 4. Running times for the Path Planning Problem. Different values for G and B. np 
denotes the number of processors. 



For the external communications, we use the standard communication model. At 
the machine level, the time to transfer B words between two processors is given by 
P+T B, where p is the message startup time (including the operating system call, allo- 
cating message buffers and setting up DMA channels) and T represents the per-word 
transfer time. 

With the internal communications we assume that per-word transfer time is zero 
and we have to deal only with the time to access the data. We differentiate between an 
external reception (p^ ) without context switch between processes and an internal 
communication (p‘) where the context switch must be considered. 

We denote by toAiJn the time to compute respectively bodyo, body/ and body 2 at it- 
eration i. 

Ts will denote the startup time between two processors. includes the time needed 
to produce and communicate a packet of size B. To generate it, the first B outputs of 
the G virtual processes must be produced. denotes the whole evaluation of G proc- 
esses, including the time to send m/B packets of size B. 

T, = to*( G-1) + Zi. (B.I) tn *G*B +G*I, . ,, ^b-i) tn +2*0 *(G - l)*B +p+ *B + P^t*B 



T, = to*(G - 1)+Zi . ,, (B-i) tn*G*m+G*Ei . ,, „ t 2 i+ 2*0*(G - l)*m+0^*m+(P^T*B)*m/B 
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The first three terms accumulates the time of computation, the fourth term is the 
time of context switch between processes and the last terms include the time to com- 
municate packets of size B. 

According to the parameters G and B two situations may appear when executing a 
pipeline algorithm. After a processor finishes the work in one band it goes to compute 
the next band. At this point, data from the former processor may be available or not. If 
data are not available, the processor spends idle time waiting for data. This situation 
arises when the startup time of processor p (the first processor of the ring in the sec- 
ond band) is larger than the time to evaluate G virtual processors, i. e, when * p > 
Tc. Then we denote by Ri the values (G, B) where * p <Tc and R 2 the values (G, B) 
such that T, *p > T^. 

For a problem with n stages on the pipeline (n virtual processors) and a loop of size m 
(m iterations on the loop), the execution time for 7 < G < n/p and 1 < B < m, is the 



T,(G, B)^T,*(p-l) + T,*n/(G * p) in R, 



T(G, B)^ ■{ 



T 2 (G, B)^T,* (n/G -1) + T, 



in R 2 



following: 

Ts* p holds the time to startup processor p and 7). * n/(G*p) is the time invested in 
computations after the startup. Note that in Rj, the processors are feeded with data and 
there is no idle time between bands. 

Ts * (n/G - 1) is the startup time of the last processor working on the last band. This 
time includes the idle time that the processors spend between bands. 

In the model T(G, B), fixed the number of processors p, the parameters 0,^, P and 
T are constants architectural dependent and to, tn, t 2 i , m and n are variables depend- 
ing on the instance of the problem. The actual values for these variables are known at 
running time. An analytical expression for the values (G, B) leading to the minimum, 
will depend on the five variables and seems to be a very complicated problem to 
solve. Instead of an analytical approach we will approximate the values for (G, B) 
numerically. 

Note that T,(G, B) < T 2 (G, B) in R 2 , T 2 (G, B) < T/G, B) in Ri and T,(G, B) = T/G, 
B) at the boundary of 7?y and R 2 (i.e., when * p = T/). This fact allows to consider 
T(G, B) = max{{Tj(G, B), T 2 (G, B)}, 1 <G<n/p and 1 <B <m}. 

An important observation is that T(G, B) first decreases and then increases if we 
keep G or B fixed and move along the other parameter. Since, for practical purposes, 
all we need is to give values for (G, B) leading us to the valley of the surface, a few 
numerical evaluations of the function T(G, B) will supply these values. 



5 Experimental Results 

To contrast the accuracy of the model we have applied it to estimate the optimal grain 
G and optimal buffer B for the path planning problem considered in section 3. In the 
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pipeline algorithm for this problem (figure 3), bodies 0 and 2 are empties while body 
1 depends on the iteration. 

The numerieal results presented in this seetion eorrespond to the running times de- 
pieted in figure 4. Table 1 presents the values for grain and buffer (G-Model, B- 
Model) obtained with the model, the running time of the parallel algorithm for this 
parameters (Real Time) and the values of grain and buffer (G-Real, B-Real) giving the 
best running time (Best Real Time). The table also shows the error made ((Best Real 
Time - Real Time) / Best Real Time) when we eonsider the parameters provided by the 
tool instead of the optimal values. The model shows an aeeeptable predietion in both 
examples with an error not greater than 19 %. 



Table 1. Estimation of G, B for the PPP. 



p 


G-Model 


B-Model 


Real Time 


G-Real 


B-Real 


Best Real Time 


Error 


2 


2 


384 


7,522 


8 


384 


6,719 


0,119 


4 


2 


192 


4,080 


8 


128 


3,440 


0,186 


8 


2 


96 


2,113 


8 


96 


1,788 


0,182 


16 


2 


48 


1,092 


8 


32 


0,968 


0,127 



Table 2 presents times and speedup obtained for the eye lie mapping, for the bloek 
mapping and for the mapping proposed by the model. Columns are labeled T-*, S-* 
respectively. Figure 5 illustrates it graphically. 

Observe the considerable improvement of the T-model column against the naive 
cyclic mapping, almost reaching a factor of 10. Although the comparison with the 
block mapping column may seem not so impressive, it is necessary to notice that the 
T-block column does not correspond to a pure block mapping but to a block-cyclic 
assignment with grain 32. The gain against such pure block mapping will be better. 



Table 2. Runnig time and Speedup for several mappings. 



P 


T-Cyclic 


S-Cyclic T-Block S-Block 


T-Model 


S-Model 


2 


78,884 


0,061 


10,527 


0,460 


7,522 


0,643 


4 


41,940 


0,115 


5,323 


0,909 


4,080 


1,186 


8 


21,020 


0,230 


2,702 


1,791 


2,113 


2,291 


16 


10,643 


0,455 


1,581 


3,061 


1,092 


4,434 



6 Conclusions 

We have developed an analytical model that predicts the effects of the Grain of proc- 
esses and Buffering of messages when mapping pipeline algorithms. The model allows 
an easy estimation of the parameters through a simple numerical approximation. The 
model is capable to be introduced into tools (like Up) that produce the optimal values 
for the Grain and Buffer automatically. During the execution of the first band, the tool 
estimates the parameters defining the function T(G, B) and carries out the evaluation 



112 



Daniel Gonzalez et al. 



of the optimal values of G and B. The overhead introduced is negligible, since only a 
few evaluations of the objective function are required. After this first test band, the 
execution of the parallel algorithm continues with the following bands making use of 
the optimal Grain and Buffer parameters. 




-T-Best 

-T-Model 

-T-Block 




_ S-Best 
_ S-Model 
- S-Block 



Fig. 5. Running Time and Speedup for the different mappings. 
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Abstract. Markov Random Field (MRF) based algorithms used for 
restoring images affected by blur and noise are generally very effective, 
but most of them are computationally very heavy. The recently proposed 
ARTUR algorithm is a deterministic method that belongs to this family. 
In this work, a parallel implementation of this algorithm is proposed on 
a PVM environment. Various results are shown and the performance is 
analysed comparing the proposed parallel implementation against the 
sequential one. 



1 Introduction 

The image restoration process we are considering in this work consists in the 
removal of the additive Gaussian noise that degrades an image. The model for 
the original non-degraded image consists of uniform regions, separated by abrupt 
changes in gray level value called edges. P. Gharbonnier et.al.^ proposed the 
Markov Random Field (MRF) based, deterministic algorithm ARTUR that per- 
forms a restoration by finding the Maximum a Posteriori (MAP) estimate of 
the original image, given the degraded one. The contribution of this work is the 
parallelization of the ARTUR algorithm and the analysis of its performance. 
ARTUR was chosen because it considers a deterministic approach that allows 
the use of non-convex potential functions. These functions are known to preserve 
edges better than the convex ones. 

PVM was selected as the framework because it is widely used in network 
environments for programs based on the message passing paradigm and because 
it is portable to a great number of computer architectures. A master-slave model 
was used to implement the aforementioned algorithm. 

This article describes briefly the theoretical basis of the image processing 
involved and the parallel implementation of the algorithm, then it shows the 
results obtained with different configurations and the derived conclusions. The 
results show that the parallelization reduces the execution time compared to a 
sequential implementation of the same algorithm. 



J. Dongarra et al. (Eds.): EuroPVM/MPI2000, LNCS 1908, pp. 113-^23 2000. 
(c) Springer-Verlag Berlin Heidelberg 2000 
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2 MAP Restoration by Energy Minimization 

The general degradation model consists of a blur operator applied to the original 
image and the subsequent addition of white Gaussian noise. This is the direct 
transformation. In order to obtain the original non-degraded image from the 
degraded one, an inverse problem must be solved . The Maximum a Posteriori 
criterion estimates the non-degraded original image / as the one that maximizes 
the a posteriori probability, and is given by 



Tmap = argmaxPr (p | /) Pr (/) , (1) 

where p is the degraded image and Pr {p \ /) is the likelihood of the degraded 
image given the original one. Pr(/) is the ‘a priori’ probability and represents 
the hypothesis made over the solution in order to regularize the problem. 

If we consider that / is a Markov Random Field |2|, this problem can be formu- 
lated in terms of the minimization of an energy function E (f), given by 



E{f) = j2 E ^ 

s <r,t> ^ '' 



where K, is the blur operator and ip is the so called potential function Pj . The first 
summation is carried out over all the positions s in the image and represents the 
faithfulness to the degraded data p. The second summation is done over all the 
pairs of adjacent positions < r,t > - called cliques - and represents the ‘a priori’ 
probability distribution over all the original non-degraded images. The potential 
function is even, positive and belongs to a family of functions that fulfils a 
set of conditions that assures edge preservation PJ. The constant A balances 
the relative weight of both summations and S has influence on the discontinuity 
detection threshold. 

It is worth mentioning that, depending on the particular potential function ip 
chosen, the associated energy may be non-convex in /. 

The MAP criterion (P) can then be reformulated as 



/map = arg minE (/) . (3) 

Several approaches have been proposed to solve this problem PI) PI) PI 
We will consider in this work the deterministic algorithm ARTUR proposed by 
P. Charbonnier et.al [Tj, which uses a dual energy E* (/, b) that depends on the 

original image / and on the set of auxiliary images b = {bh,by,bid,brd} that 
represent the edges in the horizontal, vertical, left diagonal and right diagonal 
directions respectively. This dual energy is defined by 
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E*{f,b) = J2(Ps 

S 




( 4 ) 



where the function ijj is defined such as the condition mini?* (/, b) = E (/) holds 

b 

(see |3] and P). The dual energy E* has the properties of being quadratic - thus 
convex - on / for b fixed and of being convex on b when / is fixed. The sought 
estimator (PD can then be calculated as 



Imap = arg min mini;* (/, b) . 
f b 



( 5 ) 



3 The ARTUR Algorithm 

This algorithm is based on the minimization of the aforementioned dual energy 
and follows a deterministic scheme given by 

Algorithm 1 ARTUR 
Begin 

/° = 0 

While ( /” doesn’t converge ) 

^n+i ^ argmmi;*(U,^) 
b" 

/"+! = argmini;*(U,&"+i) 

End While 

End 

It begins with a null estimate of / and proceeds with an alternate minimiza- 
tion of the dual energy E* with respect to the auxiliary images b and with 
respect to the unknown original image /, generating a sequence of estimates 
that converges to /map- 

The first minimization in the loop is straightforward and is given by = 

if' /2 ■ Ths second minimization is done using an iterative meth- 

od. To do this, we have chosen the Sequential Over-Relaxation (SOR) algorithm. 

4 Parallel Implementation 

The parallel version of the ARTUR algorithm was implemented within the PVM 
environment and using a master-slave model p]. This model allows a master 
process to take care of all input/output operations and of the distribution and 
gathering of data. The slaves carry out the calculations with the data received 
from the master. 
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The master process reads the whole image, divides it in horizontal strips, and 
sends each strip to a slave process. Then it waits for the slaves to send back the 
results, joins the data and writes them back to the disk. Meanwhile, the slaves 
receive the strips and perform the calculations of the minimization of the energy 
function. The master process is shown in the blocks diagram on Fig. Q 





read image 






ilala p 


J 


i 




send subiniages 






to slaves 




i 




receive subunagesA 






IVoni slaves 








save restored 


' 




image 





Fig. 1. Master algorithm on parallel adaptation of ARTUR. Master - Slave 
Model. 



Figure 0shows two instantiations of the slave program. On this figure, points 
A, B and C indicate the places where data exchange is needed. For each slave, 
the outer loop corresponds to the main loop in the sequential algorithm and the 
inner loop corresponds to the SOR algorithm. 

Each slave works on a different horizontal strip of the image and exchanges 
information with the two other processes it shares data with. This is true for all 
processes, except for those that calculate the minimization for the first and the 
last strips, which share data with only one process. 

The division of the image into strips was done in order to reduce the number of 
processes each process has to communicate with. 

The implementation of the slaves was actually done by fixing the number 
of iterations in both loops. Enough iterations were allowed in order to let the 
convergence factor, given by ~ /"II / ll/"ll^i attain a value equal or less 

than 10"®. 

5 Results 

On Fig. 0 image a) shows the synthetic original non-degraded image and image 
b) shows the noisy image obtained by the addition of white Gaussian noise 
with parameters /r = 0.0 and tr = 20.0. Images c) and d) show the sequential 
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restoration and the parallel restoration respectively. This restorations have been 
made with A = 25.0 and S = 3.0 and the potential function used was ip (u) = 
2V1 + u2-2. 



P. P 




Fig. 2. Slave algorithms on parallel adaptation of ARTUR for two partitions. 
Master - Slave Model. 
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Fig. 3. a) Original image, b) Noisy image, c) Sequential restoration, d) Parallel 
restoration. 





a) 



b) 



Fig. 4. a) Noisy image, b) Restored image. 



The evaluation of the difference between the sequentially restored image /® 
and the parallel restored image was made using the relative error between 
both images, defined as = II/'* — /^||^ / ||/^||^- In order to assess the quality 
of the restoration, the relative difference between the original image p and the 
parallel restoration fP, given by £p/p = \\p — /p||^ / ||/^||^, was used. The values 
obtained were £/<>/? = 0.0038 and Spfp = 0.0392 respectively. 

As an illustrative example. Fig. 0 shows a restoration of a real LANDSAT 
image of a rural area with several crop fields. In this case, the parameters used 
were A = 20.0 and S = 1.0 and the potential function used was the same as the 
one used in the case of the synthetic image. 

On Table Q we present some execution times. These results were obtained 
using the following configurations: 
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Size (Pixels) 


SUN-S 


LINUX-S 


SUN-P 


LINUX-P 


128x128 


105 


82 


69 


17 


256x256 


389 


376 


290 


61 


512x512 


1445 


1456 


1114 


204 



Table 1. Times obtained on sequential and parallell implementations on differ- 
ent machines for various images sizes. 



— a uniprocessor SUN Ultra Creator 1 computer, running Solaris 2.7 and 
PVM 3.4 (named SUN-S and SUN-P for sequential and parallel models re- 
spectively), with a processor of 167 MHz and 128 Mb of RAM memory, 

— an Ethernet network with UTP 10 Mb/s connecting eight PC machines run- 
ning LINUX Redhat6.0 and PVM 3.4 (named LINUX-S for sequential and 
LINUX-P for parallel model) . Each PC has a Pentium processor of 233 MHz 
and 32 Mb of RAM memory. 

All the execution times are measured in seconds and the image sizes are ex- 
pressed in pixels. The times for the SUN-P and LINUX-P columns were obtained 
using eight processes. 

On Fig. 0 we present the speedup for two, four and eight processors on the 
above mentioned LINUX machines. 




ideal case 



-»-128 X 128 
—*—256 X 256 
^^^512 X 512 



Fig. 5. Speedup for 2, 4 and 8 PCs using PVM under LINUX. 



The analysis of the execution times of Tabled shows that it can be better to 
run a parallel implementation on a uniprocessor machine than to run a sequen- 
tial implementation on the same machine (see columns corresponding to SUN-S 
and SUN-P). The reason for this is that using a parallel implementation, the 
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percentage of time the processor devotes to process the image is greater than in 
the sequential case. 

6 Conclusions 

From the results of the execution times on the various configurations considered, 
it is clear that the parallelization can reduce the execution time both on the 
uniprocessor architecture as well as on the cluster of computers. The execution 
times obtained with the cluster of computers are encouraging because they show 
that, at least for this type of processing, this architecture is a good choice. The 
parallelization process consisted in the inclusion of data exchange points in the 
slave processes and in the division of the image by the master process. Also, the 
model chosen and the use of PVM allow high portability, shown by the use of 
the same program on different platforms. 
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Abstract. The Least-Squares Finite Element Method (LSFEM) is ap- 
plied to solve the neutron transport equation. Standard parallel algo- 
rithms, such as domain partitioning or classical iterative solvers, are de- 
veloped and tested for 1-D benchmarks on different architectures, the 
final goal being to select the most efficient approach suitable for realistic 
3D problems. 



1 Introduction 

The neutron transport equation must be solved to know the neutron flux at a 
specific time and location, moving at a certain speed along a specific direction. 
Parallel schemes can help with the solution of this equation, which is far too 
complex for realistic 3D reactor core models. To present such schemes, here a 
simplified version of the problem is considered (1-D plane geometry, one-speed 
and steady state): 



/ + = cTs{x)P-4){x,h) +q{x,fj.) 

\ Pipix, fi) = tjj{x, ') 

where a is the total neutron probability of interaction with its media, Us its 
scattering probability and q a media-independent source of neutrons. The un- 
known is the flux tjj(x,fj,) defined for x € [0,T] and for every incident-angle 
cosine /i, with appropriate boundary conditions at x = 0 and x = L. Based on 
the use of linear continuous spatial finite elements, K.J. Resseipj has developed 
a Least-Squares solver where the conditions of existence/unicity of the solution 
are handled with particular care even in the asymptotic situation (the transport 
equation becomes singular and can be approximated by a second order diffusion 
equation) . 

In this approach, the transport equation is scaled prior to the application 
of the least-squares method, by a scaling functional S = P + t{I — P) (r > 0) 
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representing the ratio of diffusion solution part over the transport solution. The 
least-squares formulation of this scaled neutron equation is: find tp & V such 
that Wv gV: 

< SC-ip, SCu >=< Sq, SCu > (2) 

where the scalar product is defined hy < u,v >= dx dfi' u{x, fj,')v{x, fj,'), 
and the functional C by Zip = -I- a{x)I — as{x)P]ip. 

The least-squares Eq. |21is similar to a variational principle and is well suited 
for the use of spatial finite elements. The incident-angle cosine is represented by 
a spectral discretization called the Pn method. The flux is then expanded in 
spherical harmonics |2] as: 

N 

= ^pi{li)(pi{x) (3) 

where (pi{^) ^re the flux moments and p;(/r) the standard Legendre normalized 
polynomials. The flux expansion is replaced in Eq.|2|and the A^-|- 1 flux moments 
are the actual unknowns of the new equation. The domain [0,L] is split into M 
cells. Eq. 0now defines the {N -I- 1) x (M -|- 1) system of equations required to 
solve Eq. d 

A penalization technique is used to apply the boundary conditions, leading 
to the linear system: 

{S ILfS IL3 + XE^E3 = {S ILfSq + XE^d (4) 

where IL, S and E correspond to functional, scaling and boundary condition 
matrices respectively, and A is a case-dependent constant. Parallel algorithms 
and communication efforts using PVM [3| or MPI 0| will now be discussed. 

2 Different Methods of Resolution 

Eq.0 represents a sparse system written as = /, where A = {SIL)'^SIL and 
f = (S IL)"^ Sq. The global matrix A is symmetric and positive definite by the 
underlying theory of the least-squares method. Moreover, as the finite element 
method is used to represent a 1-D space variable, the global matrix remains 
block tridiagonal. At each point, -|- 1 degrees of freedom represent the flux 
moments. Each block of A is then at least {N -|- 1) x (TV -1-1). In fact, these 
blocks are also sparse because the flux moments form a 3-point stencil prior to 
the least-squares method. The global system can be expressed as: 



1 

d ■■■ 

ffl u 
1 




<P(1) 




■ 7(1) ■ 


Cm 

®M+1 _ 











The boundary conditions only affect diagonal blocks. 
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Direct methods can efficiently use the sparsity of A and a block-tridiagonal 
solution was implemented. Each block-diagonal B is LU-decomposed and com- 
pletely stored in a (iV-|- 1) x (A^-|- 1) array. This is acceptable because the number 
of moments, IV, is generally small (under 10) and much smaller than the number 
of mesh points. 

Standard iterative methods are also considered. First, block Jacobi iteration 
I + 1 can be composed of M -|- 1 independent equations that can be written as: 

= i = l,M+l (6) 

The Jacobi iterations can be relaxed by applying the parameter w as 

x^‘+^'> = + (1 - w)x(') (7) 

where is the solution of Eq. 0 This parameter is a function of the eigen- 

values of the iterative matrix, and is generally < 1. 

The block successive over-relaxation (SOR) method follows almost the same 
algorithm, but in Eq. El the unknowns that are already computed are used, 
such that x^^'^{i — 1) is replaced by — 1). This method converges, if it 

does, faster than Jacobi. An optimal relaxation parameter (0 < w < 2) can also 
be used with the SOR method. For both methods, a variational acceleration 
technique is used.0 This technique computes a new iterate as in Eq. Q, but a 
dynamic parameter w = a; is computed at each accelerated iteration to minimize 
the residual. Assuming three iterates xi, X 2 , xs, the following expression for 02 
can be used to accelerate X 2 and 



02 



< r2,rs -T2> 

\V3-r2Wl 



( 8 ) 



where T 2 = X 2 — xi and = X 3 — X 2 For stability reasons, free and accelerated 
iterations are often mixed; a cycle of 3 free, 3 accelerated is generally used. 

As A is s.p.d., the Conjugate Gradient (CG) method 0 can also be imple- 
mented where iteration k is, beginning with = f — AxP , A4z^ = r^: 

' =< > / < p’",Ap'^ > 

jjfe+l _ _|_ (ykpk 

J,fc+1 _ 

< (9) 

Pk =< > 

^ Convergence test 

_ p(fe+i) = _|_ fj^pk 



Preconditionning with M. is considered to accelerate convergence. A block- 
diagonal preconditioner is used at first because the test cases are often het- 
erogeneous. In this case, the Incomplete Cholesky factorization technique jS| 
can take advantage of the sparsity of the blocks B and C. 
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3 Parallel Algorithms 

The direct solver is a LU decomposition version followed by a triangular reso- 
lution. Parallel implementation of this direct tridiagonal solver was not done, 
because standard parallel schemes would be inefficient for natural ordering. H A 
domain decomposition method is under development. 

The present work is based on iterative methods. The block Jacobi algorithm 
is directly parallel even when over relaxed. Partitions of the global matrix A 
among the p processors is made simply by assigning k = (M + l)/p block lines 
and communicating the results to the neighboring processors. If p is at least 
equal to the number of different regions in the domain, the lines representing 
the same region are maintained together so that each processor sees and solves 
over a homogeneous domain. 

The SOR implementation reorders unknowns forming a Red/Black scheme. 
Each SOR iteration is represented in Fig.Ql The implied Jacobi steps are highly 
parallel for each colored unknown. An equal number of Red and Black unknowns 
are partitioned over the processors. When possible, unknowns corresponding to 
the same material region are kept together. 



1 . =/r- 

2 . + (1 - 

3. = fg- 

4. -f (1 — oj)x^g 



Fig. 1. Red/Black SOR algorithm 



The communication effort for partition over regions is limited to the two ID 
interfaces. So each processor sends at most iV -|- 1 unknowns to its left and right 
neighbors. Each processor computes k blocks of iV -|- 1 unknowns with 2 extra 
blocks for the neighboring data. 

The convergence criteria is based on the Loo norm readily available on all 
parallel processors: 

||(x('+'^-s')lloo <e 

Special attention is given to select an e that depends on the solution which is 
sought. 

In the two previous methods, two inner products are needed for variational 
acceleration (Eq. El), thus increasing the communication effort. However, such 
calculations are not needed at every iteration and the more free iterations are 
added, the more the inner product calculation effects will be minimized. 

The parallel preconditioned conjugate gradient method is also implemented. 
Partitions over regions are only used. Two preconditioners are selected, block 
diagonal and SSOR preconditioners. For the latter, M — {D + L)D~^{D + L^). 



Parallel Algorithms for the Least-Squares Finite Element Solution 



125 



As shown in the PCG algorithm (Eq. 0, two inner products are required 
to evaluate the parameters a and j3. The < > product is used as a first 

convergence criteria, li < ,z^ > < e, then the actual residual norm < r^,r^ > 

is computed to insure that convergence is reached. The increase in calculations 
and communication is then limited to the last iterations. 

Tabled summarizes the communication between p processors for the previous 
methods. In terms of communication effort, standard iterative methods, block 
Jacobi and block SOR, as well as preconditioned conjugate gradient need almost 
the same number of exchanges and data size exchanged per iteration. 



Table 1. Communication effort per iteration 







a 


INIoc 


Wh/P 




# comm. 


p- 1 


p 


p 


p 


2 X p 


# words 


N + 1 


4 


1 


4 


N + 1 


Methods 


SOR only 


All 


Jac./SOR 


PCG only 


All 



The neighboring exchange labeled depends on the partition and not 

on the iterative process. The block Jacobi method uses the least communication 
whereas the block SOR uses the most because of the exchange of new neighboring 
values as they are computed. It is mostly the number of iterations that will decide 
the more effective method parallel or not. 

4 Numerical Results 

Two different test cases are used to evaluate LSFEM discretization of the trans- 
port equation in a parallel environment. These test cases are: 

— Diffusive heterogeneous domain. uni 

— Heterogeneous domain containing void region. mi 

In diffusive domains, the parameters of Eq. Qhave a particular value: L » ^ 
and a ~ as- The diffusive case, labeled Larsen, is a two-region dimensionless slab 
using L = 11.0 with a fixed incoming source of 1 at a: = 0 and a free surface at 
X = L. The heterogeneous case, labeled Reed, is a five-region slab using L = 8.0 
with a perfect reflector at x = 0 and a free surface &t x = L. The properties of 
the different regions for each case are shown in Table El 

The direct solution is used as the reference result for every case. First, itera- 
tive method results are compared with reference values using a Loo norm of the 
relative error. For this, the Larsen case is chosen for which there are 2230 cells 
and 16 flux moments, so a total of 35680 unknowns. 

All iterative methods give very accurate results with respect to the direct 
solutions. Optimal over-relaxation parameters were obtained numerically. As 
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Table 2. Parameters for both test cases 



Larsen 


Reed 




i 


cr as 


q 




2.0 50.0 0.0 50.0 


•C U Us 

1.0 2. 0. 


1.0 50.0 0.0 


0.0 


2.0 


0.0 0.0 


0.0 


10.0 100. 100. 


1.0 


1.0 0.9 


1.0 




2.0 


1.0 0.9 


0.0 



expected, the SSOR preconditioner is more effective for accelerating the GC 
convergence than the block diagonal one. The Jacobi method converges in 3 times 
more iterations than SOR or CG does, and these methods show the same order 
of iterations. In this Larsen case, no source is applied so the only contribution 
other than 0 in / comes from the boundary condition at a: = 0. This boundary 
source slowly travels to one new spatial point per iteration. 

Parallel calculations are performed with p processors on cluster computers 
composed of IBM RISG-6000, using PVM 3.4, and on a IBM RS/6000 SP3 
computer composed of 4 NightHawk (222MHz Power 3) 8-processor nodes, using 
MPI. 

The Larsen case was first used to benchmark parallel calculations. Partition 
of the geometry defines the number of unknowns for each processor. Table 0 
shows the speed-up for the different methods on the IBM cluster limited by a 
lOMbytes Ethernet communication network. 

The Red/Black ordering is used for the parallel version of the block Jacobi 
as well as the block SOR methods. The block Jacobi method is directly parallel 
and shows a speed-up > 2 for p = 2 because the communication effort is really 
minimum while a total of 17920 unknowns are computing on each processor. 



Table 3. Speed-Up for the different methods on the IBM cluster 





# of processors 
p = 2 p = 5 p=10 


Jacobi 


2.41 


3.99 


3.00 


SOR 


1.70 


2.18 


1.65 


PCG 


1.90 


3.58 


2.41 



Galculation effort is equivalent for the Jacobi and SOR methods, however the 
speed-up obtained for the latter is less. In fact the 1-dimensional finite element 
method leads to an almost degenerated parallelism. The Red/Black ordering for 
the SOR method should be more efficient in speeding up a 2- or 3-dimensional 
problem. 
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Speed-up decreases with the number of processors as the communication 
effort overtakes that required for the calculations. But increasing the number of 
unknowns by a factor 10, one obtains 8.0 for p = 10 when using PGC method. 

Calculations were reproduced on the IBM SP3 computer. MPI routines re- 
place PVM ones. Both parallel softwares may be used, but the results are only 
reported for the MPI version. In fact, this implementation allows the use of the 
User Space option for inter-node communication through the SPS Switch. For 
the intra-node exchange, the MP_SHARED_M EMORY flag is set to yes. Ta- 
ble 0 summarizes the results for this computer, speed-up is evaluated in reference 
to the 2-processor calculations. 



Table 4. Speed-Up on the IBM SPS computer 







^ of processors 






p = 2 


p = 4 


p = 8 p 


= 16 


Jacobi 


1.00 


1.66 


3.33 


6.00 


PGG 


1.00 


1.69 


3.33 


6.10 



Efforts has been made to maintain the number of inter-node calculations, so 
the 2-processor and 4-processor computations use 1 processor per node. But the 
User Space version is limited to 4 processor per node, so a total of 16 processors. 
The Red/Black SOR method results in inadequate speed-up due to the 1-D 
degenerated state. The effectiveness is at most of 6 for an optimal value of 
8. Even though, no dedicated session was available, speed-ups are very much 
acceptable. 

The Reed case was used to look into the code scalability. Each processor, from 
2 to 16, treats 232 cells for 16 flux moments, so 3712 unknowns. The execution 
time as a function of the number of processors is flat from 2 to 8 processors. 
Above that, as the load average of each node was during the tests over 8.0, the 
results are not significant. 

5 Conclusion 

The LSFEM is successfully applied to solve the neutron transport equation. The 
SSOR preconditioned CG method is the most effective iterative method we tested 
in a parallel context. Both PVM and MPI versions of the code provide similar 
results on a IBM cluster computer. The speed-up on the IBM SP3 computer is 
attractive for large problems, providing the use of the User Space option. 

Partitions over cells may be used in the future to implement a domain decom- 
position algorithm coupled with this PGG method. In the 3D implementation, 
the global matrix will become 7-block diagonals for a regular mesh. The number 
of moments will increase, but the block will still be sparse. That system may 
be solved using the Alternated Direction Sweeping (ADS) method. For each 
direction sweep, the PGG algorithm herein developed can be used. 
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Abstract. The Genoa Active Message MAchine (GAMMA) is a light- 
weight communication system based on the Active Ports paradigm, orig- 
inally designed for efficient implementation over low-cost Fast Ether- 
net interconnects. A very efficient porting of MPICH atop GAMMA 
as been recently completed, providing unprecedented messaging perfor- 
mance over the cheapest cluster computing technology currently avail- 
able. In this paper we describe the recently completed porting of GAMMA 
to the GNIC-II Gigabit Ethernet adapters by Packet Engines. A combi- 
nation of less than 10 fis latency and more than 93 MByte/s throughput 
demonstrates the possibility for Gigabit Ethernet and GAMMA to yield 
messaging performance comparable to the ones from many lightweight 
protocols running on Myrinet. This result is of interest, given the envis- 
aged drop in cost of Gigabit Ethernet due to the forthcoming transition 
from fiber optic to UTP cabling and ever increasing mass market pro- 
duction of such standard interconnect. 



1 Introduction 

The low-cost processing power of clusters of Personal Gomputers (PGs) can 
be easily exploited by means of appropriate, high-level, standard Application 
Programming Interfaces such as, e.g., MPI. Several open-source implementations 
of MPI (e.g., MPIGH 0) can run on Linux-based Beowulf-type clusters on 
top of standard TGP/IP sockets. However it is well known that parallel jobs 
characterized by medium/fine grain parallelism exchange messages of small size 
(few KBytes) among processors, and that general-purpose protocols such as 
TGP/IP make very inefficient use of the interconnect for short messages. 

On Fast Ethernet, a lightweight messaging system like the Genoa Active Mes- 
sage MAchine (GAMMA) ^ P provides far better performance at no additional 
hardware cost compared to the Linux TGP/IP stack. Started in 1996 as part 
of a Ph.D. thesis, the GAMMA project was targeted right from the beginning 
to low-cost Fast Ethernet interconnects. GAMMA implements a non-standard 
communication abstraction called Active Ports, derived from Active Messages, 
and provides best-effort as well as flow-controlled communication routines. With 
an end-to-end latency between 12 and 20 /rs (depending on the hardware con- 
figuration and the NIG used), GAMMA provides adequate communication and 
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synchronization primitives for tightly coupled, fine-grain parallel applications 
on inexpensive clusters. A complete yet efficient implementation of MPI atop 
GAMMA is now available 0 |H1 • 

The link speed offered by Fast Ethernet is insufficient for many communica- 
tion intensive parallel jobs to scale up. This justifies the interest towards the two 
most famous gigabit-per-second interconnects, namely, Myrinet and the more re- 
cent Gigabit Ethernet. The inefficiency of TGP/IP is exhacerbated here, since 
the physical transmission time becomes negligible compared to the time spent 
in the traversal of the protocol stack. Lightweight messaging systems become a 
key ingredient for an effective use of fast interconnects. 

Gurrently, the per-port cost of a Gigabit Ethernet LAN is too high compared 
to Myrinet. This is largely due to the current fiber-optic cabling of Gigabit Eth- 
ernet. With the forthcoming transition of Gigabit Ethernet from fiber-optic to 
standard UTP copper cabling, however, we expect a substantial drop in cost 
which will push this LAN technology into a much larger segment of the market- 
place, characterized by competition among different vendors and a potentially 
very large base of installation; eventually, the per-port cost of Gigabit Ethernet 
will become negligible compared to the cost of a single PG, in much the same 
way as it occurred with Fast Ethernet. In our opinion, Myrinet will never enjoy 
such a large diffusion: its segment of marketplace (system-area networks) will 
remain narrow compared to LANs. 

Moreover, some Gigabit Ethernet NIGs are programmable in much the same 
way as Myrinet is. For instance, the NetGear GA620 NIG, a cheap (300 US dol- 
lars) clone of the Alteon AceNIG adapter, comes with as many as two on-board 
microprocessors and 512 KBytes of on-board RAM, which makes it possible to 
upload appropriate, possibly self-made custom firmware to the NIG. 

The only remaining difference between Myrinet and Gigabit Ethernet is the 
reliability of the physical medium, especially in case of network congestion which 
may cause packet losses. Myrinet prevents congestion by using hardware mech- 
anisms for back-pressure, whose practical effectiveness has been demonstrated. 
Gigabit Ethernet uses hardware-exchanged control packets to block senders in 
case of congestion hazard, according to the IEEE 802. 3x specification; this should 
in principle avoid congestion, and packet losses thereof, although nobody could 
assess the effectiveness of this mechanism so far. Another difference, of secondary 
concern though, is the higher communication latency of current Gigabit Ether- 
net switches (in the order of 3 to 4 /rs), compared to the very low latency of a 
Myrinet switch (less than 1 iis). 

To sum up, in our opinion Gigabit Ethernet is a promisingly successful and 
cheaper alternative to Myrinet under any respect, and an efficient lightweight 
protocol is definitely a must to make best use of this technology. 

In order to prepare for the transition to inexpensive Gigabit Ethernet, we 
started developing a prototype of GAMMA for the Packet Engines GNIG-II Gi- 
gabit Ethernet adapter. The prototype was ready in September 1999; altough 
Packet Engines discontinued the NIG a few months later, that first prototype 
of GAMMA was indeed useful to experimentally demonstrate the feasibility 
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and success of a lightweight communication protocol like GAMMA on next- 
generation inexpensive LANs. Overall, MPI/GAMMA on our GNIG-II Gigabit 
Ethernet NIGs yields excellent performance figures. On two Pentium II 450 PGs 
networked by a Gigabit Ethernet switch, a MPI user application enjoys 16 /rs 
end-to-end latency and 93.5 MByte/s peak throughput (77 % of the nominal link 
speed), comparable to if not better than many lightweight messaging systems 
running on Myrinet. 



2 The DBDMA Data Transfer Mode 

A NIG is an interface between a host GPU and a network; as such, each NIG 
must implement suitable mechanisms to cooperate with the host computer, on 
one hand, and the network, on the other hand. A modern NIG cooperates with 
the host computer using a data transfer mode called Descriptor-based DMA 
(DBDMA). With the DBDMA mode, the NIG is able to autonomously set up 
and start DMA data transfers. To do so, the NIG scans two precomputed and 
static circular lists called rings, one for transmit and one for receive, both stored 
in host memory. Each entry of a ring is called a DMA descriptor. 

A DMA descriptor in the transmit ring contains a pointer (a physical address) 
to a host memory region containing a fragment of an outgoing packet; therefore, 
an entire packet can be specified by chaining one or more send DMA descriptors, 
a feature called “gather” . 

Similar to a descriptor in the transmit ring, a DMA descriptor in the receive 
ring contains a pointer (a physical address, again) to a host memory region where 
an incoming packet could be stored. The analogous of the “gather” feature of 
the transmit ring is here called “scatter”: more descriptors can be chained to 
specify a sequence of distinct memory areas, and an incoming packet could be 
scattered among them. 

A NIG operating in DBDMA mode allows a greater degree of parallelism 
in the communication path, according to a producer/consumer behaviour. At 
the transmit side, while the NIG “consumes” DMA descriptors from the trans- 
mit ring operating the host-to-NIG data transfers specified in the descriptors, 
the GPU runs the protocol stack and “produces” the necessary DMA descrip- 
tors for subsequent data transfers. The reverse occurs at the receive side. Since 
both sides are decoupled from each other, the communication path works like 
a pipeline whenever traveled by a sequence of data packets, with a potentially 
high throughput. 

However, the Linux implementation of IP forces the NIG device drivers not 
to exploit the “gather/scatter” features. This implies that at least two tempo- 
rary copies of data are needed, one at the sender and the other at the receiver 
side, because header and payload of each packet are expected to be contiguous 
in the same memory area at both sides of communication when control reaches 
the device driver. A different organization and semantics of the communication 
protocol could eliminate the memory copy at the sender side by exploiting the 
“gather” feature: the header could be precomputed and stored somewhere in ker- 
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nel space, the payload could be pointed to directly in user space, and the NIC 
would autonomously arrange them contiguous into its on-board transmit FIFO 
and send the whole packet. However, avoiding the memory copy at the receiver 
side is impossible: the final destination in user space for the payload of an incom- 
ing packet can be determined only after inspecting the header, which implies the 
packet be already stored somewhere, namely, into a temporary buffer. The only 
way to avoid a memory copy at the receiver side is to run the communication 
protocol by the NIC itself, and let it inspect the headers when packets are still 
in its on-board receive FIFO. 

To sum up, a lightweight protocol for non-programmable NICs can be zero- 
copy on send thanks to the “gather” feature, but it must be “one-copy” on 
receive. Indeed, all the latest prototypes of GAMMA took this very organization, 
where most of the send overhead is in charge of the NIC and most of the receive 
overhead is in charge of the host CPU. 

3 Maximizing the End-to-End Throughput 

The theoretical maximum throughput with Gigabit Ethernet is 125 MByte/s, 
roughly equivalent to the maximum throughput of the 32 bit, 33 MHz PCI 
bus where the NICs are plugged. However, a first prototype of GAMMA on 
Packet Engines GNIC-H adapters yielded a peak end-to-end throughput of only 
80 MByte/s. We immediately started investigating the causes for such a low 
efficiency, using a pair of Pentium II 450 MHz PCs with 100 MHz system bus. 

To begin with, we identified the following consecutive stages in the commu- 
nication pipeline, from sender to receiver, in our opinion accounting for most of 
the total communication effort: 

— stage 1: the “consumption” of DMA descriptors from the transmit ring and 
the corresponding DMA data transfers from the host RAM to the NIC, 
operated by the sender NIC; 

— stage 2: the physical link; 

— stage 3: the DMA transfers from the NIC to the host RAM and the cor- 
responding “production” of descriptors in the receive ring, operated by the 
receiver NIC; 

~ stage 4: the “consumption” of DMA descriptors from the receive ring and 
the related protocol action and data movements, carried out by the receiver 
CPU. 

Measuring the throughput of all the above pipe stages is not necessarily an 
easy task. Throughput of stage 1 can be measured directly, and throughput of 
stage 2 is known in advance (125 MByte/s). However, throughput of stage 3 
cannot be easily measured by a direct technique. Moreover, we have to evaluate 
the effect of bus contention between stages 3 and 4 when both try to access the 
host RAM. Finally, we do not know in advance where the bottleneck is, and the 
presence of a bottleneck invalidates any direct throughput measurement taken 
below it in the pipeline. 
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The throughput of stage 1, directly measured, was 80.8 MByte/s, very close 
to the end-to-end throughput. This meant that stage 1 was the bottleneck, and 
also that we could not know much more about stages 3 and 4. 

To proceed with our analysis we had to eliminate this bottleneck, even sacri- 
ficing the protocol correctness. We suspected that such a slow speed was caused 
by the “gather” feature: the sender NIC had to “consume” as many as two 
DMA descriptors for each outgoing packet. In order to prove this, we temporar- 
ily switched to a different transmission technique. The sender CPU “produces” 
only one descriptor for each packet, pointing to a dummy packet of appropriate 
size but containing only the header, without user data. This way the sender trans- 
mits wrong data, but this does not hurt performance measurements. We then 
observed a much higher throughput of 98 MByte/s for stage 1. The peak end-to- 
end throughput did not increase as much, though: it only measured 93.8 Mbyte/s, 
indicating that a second bottleneck was set up somewhere else. Throughput of 
stage 4, as directly measured, was about 100 MByte/s, therefore stage 3 was the 
bottleneck with its 93.8 MByte/s. By temporarily disabling the data movements 
in stage 4 we were able to isolate the performance degradation due to the con- 
tention between stages 3 and 4 on the memory bus; indeed, disabling stage 4 led 
to an increase of the stage 3 throughput from 93.8 to 96 MByte/s. 

The lesson is now clear: if we want to maximize the end-to-end GAMMA 
throughput with the GNIC-II Gigabit Ethernet NICs, we should not use the 
“gather” feature of the NIC at the sender side. This however forces us to add 
another stage in the communication pipeline just where it begins, namely: 

— stage 0: the copy of the packet payload from the user buffer to a preallo- 
cated memory buffer already containing the precomputed header, and the 
corresponding “production” of one single DMA descriptor into the transmit 
ring, carried out by the sender CPU. 

Indeed we have implemented such an additional pipe stage, which exhibits 
a throughput of 95 MByte/s. Adding such a temporary copy on send was the 
paradoxical price we had to pay in order to maximize the end-to-end throughput 
of GAMMA with our Gigabit Ethernet adapter. 



4 Improving Throughput for Short Messages 

Fragmenting a message into packets is a need when the total message size exceeds 
the maximum MTU size of the network. Fragmentation increases the CPU over- 
head for header processing, and leads to a lower utilization rate of the physical 
link. However, since the end-to-end communication path is a pipeline, fragmen- 
tation also leads to a much better throughput, by exploiting the parallelism 
among pipe stages. For this reason, message fragmentation has been exploited 
also with Myrinet, which is not packet oriented. Even with Fast Ethernet, it can 
be convenient to fragment a message even when it is smaller than the maximum 
allowed MTU size 0 . 
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To improve the efficiency with short messages, GAMMA can fragment each 
short message into packets whose size is not necessarily maximal. Of course, this 
increases the number of packets exchanged. The optimal fragment size depends 
on the total message size and also depends on the performance profile of the 
communication hardware. Due to the lack of a satisfactory performance model 
of the whole communication system, we could only find an empyrical formula 
for optimal fragmentation, valid only for the GNIC-II adapters: messages up to 
512 bytes are fragmented in packets of 128 bytes, messages from 513 to 1664 
bytes are fragmented in packets of 256 bytes, messages from 1665 to 5148 bytes 
are fragmented in packets of 384 bytes, messages from 5149 to 11000 bytes 
are fragmented in packets of 768 bytes, messages from 11001 to 12000 bytes 
are fragmented in packets of 896 bytes, and larger messages are fragmented in 
packets of 1408 bytes. 



5 Communication Performance 

The measurement platform was a pair of PGs, each with a single Pentium II 450 
MHz GPU, 100 MHz system bus, and a Packet Engines GNIG-H NIG. The two 
machines were connected back-to-back by a fiber-optic cable. 

Self-made “ping-pong” microbenchmarks have been run to measure the av- 
erage end-to-end communication latency of GAMMA, MPI/GAMMA, Linux 
2.2.13 TGP/IP sockets, and MPIGH atop TGP/IP sockets. The measured one- 
way latency is 9.5 /iS with GAMMA, 12.3 /xs with MPI/GAMMA, 132.1 jis with 
TGP/IP sockets, and 320.9 /xs with MPIGH. Adding an Intel Express Gigabit 
Switch in between the two machines results increases the latency by 3.6 /xs. 

By the same “ping-pong” programs we can estimate the end-to-end through- 
put (not to be confused with the transmission throughput as measured by the 
usual one-way “stream” tests). Figure E reports the throughput curves of the 
following messaging systems: GAMMA using the optimal fragmentation scheme 
(curve 1), GAMMA with standard fragmentation (curve 3), MPI/GAMMA, 
based on the optimal fragmentation (curve 2), Linux 2.2.13 TGP/IP sockets 
(curve 4), and MPIGH atop TGP/IP sockets (curve 5). The maximum MTU size 
with GAMMA is 1408 bytes (1388 bytes of payload plus 20 bytes of GAMMA 
header), which experimentally provided better performance on our platform 
(possibly due to a better cache line alignment). 

We briefly sketch the main conclusions that can be drawn from the latency 
numbers and the throughput picture: 

— The communication performance of MPI/GAMMA on a Gigabit Ethernet 
LAN based on Packet Engines GNIG-H adapters is comparable to, if not 
better than many lightweight implementations of MPI running on Myrinet, 
even taking into account the additional latency of a switch. 

— The efficiency with short messages of Linux TGP/IP, and MPIGH thereof, 
remains too low for running a tightly coupled, fine-grain parallel job on a 
commodity cluster. 
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— The performance improvement yielded by the optimal fragmentation tech- 
nique with GAMMA short messages, as shown by comparing curve 1 against 
curve 3, is significant (for instance, a 64% improvement in throughput is ob- 
tained with 1388 byte messages). 

— The performance degradation caused by stacking MPI atop GAMMA, as 
shown by comparing curve 1 against curve 2, is very modest; the same does 
not hold with MPI atop TGP/IP sockets (curve 5 is much lower than curve 
4). This is obvious, given that the porting of MPIGH atop GAMMA was 
made at the ADI level. The resulting MPI/GAMMA stack is thin compared 
to the standard MPIGH/P4/TGP stack. 

— Linux TGP/IP is not able to saturate Gigabit Ethernet: its peak throughput 
is only 42.4 MByte/s. The scenario is completely different with Fast Eth- 
ernet, where Linux TGP/IP is indeed able to almost completely saturate 
the physical link, thanks to recent improvements with device drivers and 
protocol. 

— The maximum end-to-end throughput achieved by GAMMA is 93.8 MByte/s, 
and amounts to the measured throughput of the NIG on receive (stage 3 in 
the communication pipeline, see Section E|. This means that GAMMA is 
able to saturate the NIG, and that the NIG itself is the communication bot- 
tleneck. However, the obtained throughput is a very respectable result, given 
that the GNIG-II adapter does not support frames larger than 1526 bytes, 
which is a very small size given the high link speed (the Alteon AceNIG and 
its clones yield a slightly better throughput but only using a larger MTU 
size, a non-standard feature called “jumbo frames.”). 



6 Related Work 

Most existing efficient implementations of MPI for clusters presently run on 
Myrinet (e.g., lailES). To the best of our knowledge, the only attempt of provid- 
ing an efficient implementation of MPI for Fast Ethernet and Gigabit Ethernet, 
besides ours, is represented by MVIGH j2|, the VIA-based implementation of 
MPI developed in the framework of the M-VIA project JQ. However, little infor- 
mation is available yet on MVIGH; the only available performance information is 
related to MVIGH atop Giganet cLAN, an expensive “VIA inside” gigabit-per- 
second interconnect; MVIGH on Giganet yields 13.5 /iS latency and 97 MByte/s 
peak throughput, that is, just slightly better compared to MPI/GAMMA on 
Packet Engines GNIG-H adapters. 
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Fig. 1. End-to-end throughput of GAMMA, MPI/GAMMA, Linux 2.2.13 
TGP/IP, and MPIGH atop TGP/IP, using the GNIG-II Gigabit Ethernet 
adapters. 
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Abstract. Checkpointing techniques have widely been studied in the literature 
as a way to recover from failures in sequential, distributed and parallel envi- 
ronments. However, most of the checkpointing mechanisms proposed so far fo- 
cus only on the recovery of the application data. If the application performs 
some I/O operations to disk files, such schemes may not work correctly, as they 
do not provide rollback-recovery for the file contents. 

In this paper, we present a distributed checkpointing mechanism for a Parallel 
File System that can be integrated with any of the previous application check- 
pointing algorithms. Three different file checkpointing schemes will be pre- 
sented, tested in that mechanism and discussed in detail. The distributed 
mechanism proposed was integrated in PIOUS - a public-domain parallel file 
system developed for the PVM distributed computing environment. 



Keywords: Fault-Tolerance and Reliability, Checkpointing, File checkpoint- 
ing, Parallel I/O, Extensions and improvements to PVM. 



1. Introduction 

In the past decade there has been a considerable effort to develop, provide and sup- 
port high performance computing systems for solving Grand-Challenge problems. 
These include climate predictions and control, air and water pollution, combustion 
dynamics and high-performance aircraft simulation, amongst others [8]. To support 
intensive computing for different scientific applications several parallel and distrib- 
uted systems with a large numbers of processors have been developed. 

One of the major problems in parallel/distributed systems is the MTBF (Mean Time 
Between Failures), which tends to decrease significantly with the number of proces- 
sors. A large percentage of complex scientific codes may take several hours, days or 
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even weeks to execute their tasks, so their performance may be strongly affected by 
the MTBF of the system. 

Some problems arise when the system has to be reset due to a crash in the application, 
or just for maintenance purposes. The application must be restarted from the very 
beginning, what can be very costly and does not assure a forward progress. 

For long-running scientific applications, it is desirable to have the ability to save the 
state of the computation in order to continue it from that point at a later time. Check- 
pointing is the solution for this problem: it allows applications to save their state to 
stable storage at regular intervals. If a failure occurs then the application is restarted 
from that previous point, without unduly retarding its progress. 

Several checkpointing mechanisms have been proposed in the literature [1-6]. How- 
ever, most of these mechanisms focus only on the recovery of the application data. 
Most of the scientific computation deals with enormous quantities of data, which 
exists not only in main memory but also in disk files. This means that checkpointing 
mechanisms should also include the state of the files, otherwise some inconsistencies 
may occur in the recovering operation that is performed after the failure. 

The checkpointing mechanism proposed in [6] tried to solve part of the problem by 
saving the position of the file pointer (fp) and the size of the file at each checkpoint 
operation. It may work for some situations, like when read-only files or write-only 
files (with write-append operations) are in use. However it is not effective to recover 
from operations (like write, delete or trunc) that update the file contents. So this 
mechanism represents only a partial solution to the problem. 

First we considered the use of atomic transactions [15] in file access operations, 
which included a transaction of all file operations done between two checkpoints. 
However the use of transactions should degrade considerably the performance of 
scientific parallel applications because it can kill the concurrency between processes. 
Then, the solution to assure the consistency of files after a recover operation is to 
extend data checkpointing mechanisms to files. So, it is necessary to develop file 
checkpointing mechanisms and integrate them with previous checkpointing schemes. 
The main topic of this paper is the study of file checkpointing schemes and its inte- 
gration in a proposed distributed checkpointing mechanism for a parallel file system. 
It was integrated in PIOUS [7], a public-domain parallel file system developed for the 
PVM [11] distributed computing environment. 

The rest of the paper is organized as follows: section 2 describes three file check- 
pointing schemes. Section 3 presents a distributed checkpoint mechanism for a Paral- 
lel File System. A performance study is presented in Section 4. A comparison with 
related work is done in Section 5. Finally, section 6 concludes the paper. 



2. File checkpointing schemes 

In this section, we analyze in more detail the necessary procedures for checkpointing 
application files. Basically, the items that should be saved in a checkpoint operation 
are the position of the file pointer, the size and the contents of the file and some of the 
file attributes. We developed three file checkpointing schemes: 
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Shadow - It duplicates the files at each checkpoint. The original copy can be modi- 
fied in any way by the application. If a failure occurs, the system has only to restore 
the contents of the files by replacing the original ones with the corresponding shadow 
files. This version can be very inefficient for most of the cases. 

Log Save - It saves in a log file the contents of the files at previous checkpoint (and 
that were updated by the application). If a failure occurs in the rollback operation, the 
application has to recover the previous checkpoint, and the files are restored from the 
log file. 

Log Write - It writes in a log file the blocks written by the application in files since 
previous checkpoint. The physical files are only modified in checkpoint operations. 
The write append operations are done directly in files. Some read operations are redi- 
rected to the blocks stored in log. If a failure occurs, is quite simple to recover the 
files, as the contents stored in last checkpoint were kept unmodified. 

Considering the mode of operation of the file we have introduced some optimizations 
in schemes, namely in read-only and in write-only files. A comprehensive description 
of these schemes was presented in [10] and [14]. As described, they can be integrated 
in any of the previous application data checkpointing schemes. In this situation, to 
assure the atomicity of the checkpoint operation, a simple two-phase commit protocol 
should be applied; otherwise, it is possible that a failure occurring during checkpoint 
time can cause inconsistency between the data checkpoint and the file checkpoint. 



3. Distributed checkpointing mechanism 

One of the major bottlenecks that affects parallel scientific codes is I/O. When I/O 
operations are performed exclusively in one disk this becomes a bottleneck. This may 
cause a significant degradation in application’s execution time. Parallel File Systems 
(Parallel I/O) represents the solution for faster I/O, multiplying the number of physi- 
cal devices over which a file is stored and therefore increasing the I/O bandwidth. In a 
parallel I/O system we can make simultaneously accesses to different blocks of a file. 
Most of Parallel File Systems (Parallel I/O) still lack support for fault-tolerance, 
namely the assurance of the file’s consistency contents after a failure. 

Since a Parallel File System splits each file in segments and stores them in multiple 
I/O servers that reside in different machines, it is necessary to have a distributed file 
checkpointing mechanism. It must assure that each I/O server has a local file check- 
point support and that all local checkpoints form a coherent distributed checkpoint. 

To assure the local file checkpoint in each I/O server we can use one of the three file 
checkpointing schemes (Shadow, Log Save or Log Write) proposed in Section 2. 
Therefore, shadow or log files are distributed over several I/O servers. 

To avoid inconsistencies, and also to assure the failure atomicity on the distributed 
checkpoint operation, we propose a coordinated two-phase file checkpointing mecha- 
nism. The coordinator of the operation must be a file system process that keeps the 
information about the parallel files. The checkpoint operation can only start after all 
previous I/O operations are concluded. 
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Figure 1: Distributed checkpointing for a Parallel File System 

The distributed file checkpointing mechanism that we propose for a Parallel File 
System is represented in Figure 1 . It includes an application process (PO) and a paral- 
lel file system formed by a coordinator (IOC) and two I/O servers (lOSl and IOS2). 
An application process (PO) requires a checkpoint operation, which is formed by a 
data checkpoint (DATA_CKPT) and a file checkpoint. When the 10 coordinator 
receives a message (CKPT_FILES) the two-phase file checkpoint operation is started 
(point S). In the first phase, to prepare the checkpoint, the coordinator sends a multi- 
cast message {doJ^ckpt(N)) to all I/O servers. They store (in stable memory) the at- 
tributes of segment files and then send a reply iiosJ^ckpt_done(N)) to the coordinator. 
After receiving all the replies and a message announcing the successfully conclusion 
of the data checkpoint (point P), the first phase of the file checkpoint is complete. 

In the second phase every I/O server will execute its local checkpoint, using one of 
the three file checkpoint schemes (Shadow, Log Hold or Log Write). Then a message 
{ios_commit_done(N}) is sent to the coordinator announcing that the operation is 
completed. After the successful reply of all I/O servers (point C), the coordinator 
completes the file checkpoint protocol and the whole checkpoint operation. 

If, during the first phase, a failure affects an I/O server, the entire operation is 
aborted. A failure during phase two causes no harm, because the operation will be 
concluded after the server restart. So, we can conclude that the proposed mechanism 
warrants the atomicity of the checkpoint operation. 

To reduce the overhead introduced by the checkpoint, the mechanism only suspends 
the application during the first phase of the checkpoint (just to point A), running 
concurrently with the second phase of the file checkpoint (from point A to point B). 
We implemented the distributed checkpointing mechanism in PIOUS [7]. Lor the 
sake of portability, we did not use any particular feature of PIOUS unavailable to 
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other systems. Thus, the proposed checkpoint mechanism can be easily ported to any 
other Parallel File System. 



4. Performance Results 

In this section we present some performance results that were obtained in a heteroge- 
neous distributed system composed by four workstations: a Sun Sparc 2 running Sun 
OS 4.1.2 with 64 Mb RAM and 1.05 Gb of hard disk storage (HD), a Sun Sparc 10 
running Sun OS 4.1.3 with 64 Mb RAM and 1.05 Gb of HD, and two DEC Alpha 
21164 running Digital Unix 4.0b with 128 Mb Ram and 2.1 Gb HD. Workstations 
were connected via a 10 Mbits Ethernet network. 

We used as benchmark a PIOUS (pfbench) distributed application that splits each file 
in segments and stores them in the four different machines. It also periodically per- 
form read and write operations in segments. Benchmark application was executed 
using PIOUS 1.2.2 and PVM 3.3.3. All machines were used as I/O servers and the 
Sun Sparc 10 also was the coordinator of PIOUS file system. 

We measured the overhead introduced in read, write and checkpoint operations by 
each checkpoint scheme (Shadow, Log Save or Log Write) included in the proposed 
distributed checkpointing mechanism. Due to the lack of space we only present the 
most important Eigures and we refer the interested reader for the [14] document. 
Figure 2 represents the overhead when changing the number of write operations per- 
formed during the execution of the benchmark: 12, 24, 120 and 240 in each segment. 
The size of file used was 5760 kb (1440 kb in each I/O server) and each written block 
has the size of 6 kb. Pwrite represents the write operation in original PIOUS version. 
As shown in figure 2a, the performance of Log Save mechanism is the most penalized 
with the increase of the number of write operations done by an application. This is so 
because that scheme introduces a considerable overhead during each write operation. 
In Log Save every write operation involves two additional disk accesses, as, before 
writing the block to the file, the previous contents are read and synchronously written 
in the log file. 





Figure 2: a) Overhead in write operations 



b) Overhead per checkpoint 





142 



Vitor N. Tavora, Luis M. Silva, and Joao Gabriel Silva 



Figure 2b represents the consequence of the number of write operations performed in 
the overhead at the checkpoint time. It shows that Log Write and Shadow schemes 
take more time executing each checkpoint operation because Log Write scheme needs 
to update the original file and Shadow scheme should duplicate the file. However the 
overhead introduced by these mechanisms in checkpoint operations can be negligible 
for the application as it mainly occurs during the phase 2 of file checkpoint. This 
phase is executed concurrently with the application running. We can also conclude 
that the first phase (Ckpt_Phasel) of the file checkpoint operation, during which the 
application must be suspended, introduces an insignificant overhead. 

Figure 3 shows the average overhead of a checkpoint operation when changing the 
size of the file (2880 kb, 5760 kb, 11520 kb and 23040 kb), which was split in four 
segments with the same size (one for each I/O server). The application performed 60 
write operations in each segment with 12 kh blocks. 

As expected, when file size increases the Shadow scheme needs substantially more 
time to conclude the checkpoint operation. This is so because it must replicate the file 
during that operation. The application is just suspended in phase 1 of the checkpoint, 
however during the phase 2 some file accesses are suspended while it is replicated. 
The shadow scheme can also introduce a significant space overhead due the replica- 
tion done at each checkpoint and at the first open access to the file. The other 
schemes (Log Save and Log Write) just store in a log file the portion of data updated 
since that operation. The file contents updated since previous checkpoint never is 
bigger than the file size and usually they are much smaller than the entire file. 

So, we can conclude that the Shadow scheme can be very inefficient for applications 
that make use of files with large- size segments. 

We have verified that the three schemes introduced a very negligible overhead at the 
read operation. Their performances were similar at that operation. 

We can also conclude that the Log Write is the most effective, especially when the 
application executes a considerable number of I/O operations (with some writes) and 
makes use of files with large- size segments. 




Figure 3: Overhead per checkpoint (file size) 
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5. Comparison with Related Work 

In this section, we give an overview of related work published in the literature and try 
to compare with the file checkpointing schemes herein presented. 

Some authors [12] assume that scientific applications only perform I/O before the 
computation begins and after it finishes. Since this is not always the case, it means 
that such scheme does not provide the adequate I/O checkpointing. 

Another strategy was followed in [13]: all the files that are open with write or 
read-write permissions are replicated on several disks. That system provides support 
for file availability but does not assure file consistency in case of rollback. 

The IBM Vesta Parallel I/O file system [16] provides a checkpointing facility for 
application files. There is a checkpoint function that creates a snapshot of the file's 
contents. This scheme has some similarities with our shadow mechanism, since the 
entire file is checkpointed in a synchronous way. 

A file checkpointing scheme for UNIX applications has been proposed in [9]. It is 
based on a lazy-backup approach: it saves the size of the file and the file pointer in 
each checkpoint operation, and it makes a shadow copy of the file only when it is 
about to be modified. Despite the "lazy" nature this scheme, it is quite similar to our 
shadow version. No implementation details or performance results were presented. 
The previous file-rollback functionality [9] was optimized later and taken out as a 
separate file checkpoint library libfcp [17]. It is similar to our log save scheme: it uses 
an in-place update with undo logs approach to checkpoint files. It intercepts all file 
operations except for read-only ones. When a file is opened for modifications, its size 
is recorded and an undo log of file truncation is generated. When the portion of the 
file that existed at previous checkpoint time is about to be modified, an undo log of 
restoring the pre-modification data is generated. When a rollback occurs, these undo 
logs are applied in a reversed order to restore the original files. 

In [18] it was proposed a file checkpoint scheme similar to our Log Write mecha- 
nism: it buffers all modification operations done in user files after the last checkpoint. 
At the time of the next checkpoint, the buffered operations are flushed from the 
buffer to the corresponding user files and then the buffer is cleared. 



6. Conclusions and Future Work 

Checkpointing is extremely important for large and time-consuming applications, as 
restarting programs from the beginning can be very costly. A solution that supports 
only checkpointing of the application data but neglects the use of files might be use- 
less in many situations. Supporting file checkpoint is quite important because most of 
the real applications involve file I/O. 

This paper presents a distributed checkpointing mechanism for a Parallel File System 
that can be integrated with any application data checkpointing algorithm. The mecha- 
nism was integrated in PIOUS file system. We introduced and tested three different 
file checkpointing schemes (Log Save, Log write and Shadow) in that mechanism. 
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We concluded that the Log Write scheme can he the most effective, especially when 
the application executes a considerable number of I/O operations (with some writes) 
and makes use of files with large-size segments. This is due the overhead introduced 
by the Log Write scheme in each write operation and the inefficiency of Shadow 
scheme when applications make use of files with large-size segments. 

As future work, we plan to improve the performance of Log Write scheme, namely 
by storing part of the log mechanism in main memory, and by integrating the pre- 
sented distributed checkpointing mechanism in others parallel file systems. 
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Abstract. This paper describes the implementation of a thread com- 
munication library on top of MPI. It allows light-weight threads to com- 
municate with each other both locally between threads within the same 
process as well as globally between threads on different processors. The 
interface is similar to MPI with the use of thread identifiers instead of 
processor ranks. 

Problems occur in the implementation of global communication opera- 
tions. Due to limited tag space we are not able to specify source and 
target thread identifiers in a call to MPI^Recv. As a result we may re- 
ceive messages from the wrong thread which has to be resolved explicitly. 



1 Introduction 

MPI, the message passing interface, defines as a standard the communication 
between different processors in a parallel machine or different computers in a 
network. While in MPI-1 the processors are statically allocated at startup 
time, MPI-2 |S| allows similar to PVM the dynamic creation of processes. 

Different processes on one processor are used in numerous cases. One reason 
is to load-balance tasks of different, statically unknown execution time, where 
multiple tasks are allocated to one processor in order to balance the different task 
sizes PJ. Another case appears where the work is split into different subobjects, 
but their number does not exactly matches the available processors 0 E] . 

However, while creating independent processes allows much flexibility in the 
program, it is not appropriate in all cases. This is due to the fact that it re- 
quires to load another program into the memory, allocate a new process for it 
and regularly switch between the different processes which is quite expensive. 
Independent processes also cannot communicate with each other via direct mem- 
ory access but have to use something like sockets Q For that reason the use of 
(light-wight) threads instead of processes is often desirable. Threads run within 
one process allowing them direct memory access as well as thread switches with 
very small overhead. 

Our aim is to provide a library where different threads can communicate with 
each other. This includes both the communication of threads within one process 

^ At least a switch to the kernel mode is necessary to access memory of another process. 
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as well as with threads located on another processor. The interface is similar to 
MPI, such that the thread communication forms a layer on top of MPI. 




Po Pi 



The functions are called TMPI\_xxx for thread communication over MPI. Be- 
sides that the only differences to MPI within the interface are that the source 
or target is a thread identifier instead of the processor rank and the MPI com- 
municator is replaced by a thread communicator. This thread communicator 
internally contains 

— the MPI communicator with the group of processors, 

— the information where a certain thread is located, i.e. a mapping from thread 
identifiers to processor ranks, 

— as well as some queues of open send and receive operations and MPI requests 
which will be explained later. 

Therefore the threads can communicate in a MPI style, where the difference 
between local communication between threads within one process and global 
communication between threads on different processors becomes invisible for 
the user. 

This paper describes the implementation of the thread communication li- 
brary on top of MPI. The actual implementation is based on OPAL-MPI [Z| and 
communicating agents 0, but the description is independent from that. The 
local communication is described in section El while the global communication 
is handled in section 0 Section 0 deals with the problems of synchronous and 
collective operations. Finally, section 0 concludes. 



2 Local Communication 

While in the case of global, inter-processor communication the matching send 
and receive operations can in principle be executed simultaneously (on different 
processors), this is not possible for local communication. Threads run concur- 
rently on one processor, so only one thread can be executed at a certain time. 

Therefore a send operation has to store the data such that a following, match- 
ing receive operation can fetch them. Analogously, a receive operation stores a 
receive request such that a following send can directly provide the data. To keep 
the order of communication operations, i.e. the value that has been sent first has 
to be received first, we manage queues of open send and open receive operations 
within the thread communicator. 
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A local isend operation, i.e. an isend where both source and target thread 
are allocated within the same MPI procesfl therefore has to distinguish the 
following two cases. If there exists an open receive request then we remove the 
first matching request from the open receive queue, put the data there and mark 
the request as ready. Otherwise we just append a new send request at the end of 
the open send queue. In order to avoid unnecessary data copying we just store 
a pointer to the data within the request and not the data itself. 

A local irecv operation works analogously. First it checks if there is already 
data available in the open send queue. If this is the case it stores a new receive 
request. 

The more interesting part happens within the wait operation. If the request 
is already ready, that means an isend could directly store the data within a 
matching irecv request or an irecv could immediately read the data from an 
open isend, then the wait has to do nothing. Otherwise, if we wait for an open 
isend or irecv operation, the current thread has to be suspended and another 
active thread has to be chosen by the scheduler. The suspended thread can 
only become active again if another thread performs a matching communication 
operation, which will change the mode of the request to ready and awake the 
thread. 

3 Global Communication 

3.1 Matching Thread Identifiers to MPI Parameters 

For the global communication we want to use the routines provided by MPI. 
Unfortunately, they cannot be used directly as we have to map the thread pa- 
rameters to the MPI parameters. In MPI, a communication is described by a 
communicator, the rank of source and target processors within the communi- 
cator as well as a tag. In our thread communication library we have to take 
care of the fact that on one processor there can be multiple concurrent threads. 
We therefore have to ensure that a thread only receives that data which was 
explicitly sent to it and not to any other thread on the same processor. 

This means that we have to encode the identifiers of source and target thread 
within the above mentioned MPI communication parameters. As the processor 
number is already uniquely determined by the thread running in it, we could 
only use the MPI communicator or the tag to encode the thread identifiers. Us- 
ing extra MPI communicators does not make sense due to the large number of 
communicators needed and therefore the large memory overhead. As the com- 
municator internally contains among other things the list of processors as well as 
their topology, assigning a MPI communicator for every thread identifier would 
imply at least 0{P^) memory usage on each of the P processors 0 Unfortunately, 

^ which could be a static processor in case of MPI-1 or a dynamic process in case of 
MPI-2 

® If MPI conld guarantee that the internal data is shared between communicators 
containing the same process group, we were able to use a special communicator for 
every source/target thread-id pair. 
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we also cannot use the tag to encode the thread numbers, because that would re- 
move the possibility to use different tags for the thread communication. Besides 
that it would also restrict the number of possible threads^ 

Therefore we can only map the thread identifiers to the corresponding process 
ranks and pass the tag and communicator directly to MPI. In order to distinguish 
the communication from or to multiple threads on one processor, we have to send 
the identifier of the own as well the the remote thread together with the data. 
So a call 

TMPI_ISend(data, . . . , target-id, tag, thread-comm) 
will be mapped to 

MPI_ISend( [self-id, target-id, data] , . . . , 

PE(target-id) , tag, MPI-comm(thread-comm) ) 



3.2 Receiving Wrong Messages 

The problem with that approach is that we cannot fully specify the desired com- 
munication operation in MPI. This is unfortunately unavoidable as long as MPI 
does not provides a richer tagging scheme. Allowing arrays of tags instead of a 
single integer tag would have been sufficient in our case. For the receive oper- 
ation we therefore can only specify the communicator, the tag and the remote 
processor rank but not the thread identifiers of the source and the current thread 
as target. 

As a result, waiting for a receive request can deliver data which are either 
for a different thread than the current running one, or have been sent from 
a different thread than the desired one. In that case where we got the wrong 
message we store it in the open send queue for a later receive and make a new 
call to MPI_IRecv/MPI_Wait until we get the data with the correct source and 
target thread identifiers. However, in case there exists already an MPI request 
belonging to a receive operation where source and target thread, tag and com- 
municator matches, then we directly put the data to the corresponding receive 
request, mark it as ready and use its MPI request for our receive. This is possi- 
ble, as communicator, tag and processor rank specified within the MPI request 
correspond to our receive operation. 



3.3 Handling MPI_Wait 

A second problem occurs if the scheduler switches to another thread only at 
certain points in the program. Most thread systems avoid the overhead of a 
time-slicing scheduler used for large-grained processes, as this may result in 

^ On a 32-bit computer we could encode approximately 32.000 threads as we have to 
encode both source and target identifiers, and some values like MPI_ANY_TAG cannot 
be used. This results on a machine with 2.000 processors to only 8 threads per 
processor. 
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interrupts at arbitrary points. Therefore the whole program context consisting 
of register values, stacks etc. has to be saved. Instead of that light-weight threads 
are scheduled only on demand. This leads to smaller overheads, but the inability 
to interrupt calls to external libraries. For this reason we cannot directly execute 
MPI_Wait if a wait is called, as this may result in deadlocks. Image we have two 
threads on the same processor where one is sending and the other is receiving 
data from a third thread on another processor: 



To 



Po 



Ti 



irecv(T2, . . . ,Reqo) isend(T2, . . . ,Reqi) 

wait(Reqo) wait(Reqi) 



Pi 

T2 

recv(Ti) 

send(To) 



If now thread Tq is executed before T\ then the processor Pq is locked in the call 
of MPI_Wait (Reqo) . As thread switches only occur between different commands 
and not in a time-sliced manner, process Pq will stay within the waiting operation 
and thread T\ will never be executed. That means that T2 will wait for T\ and 
never send anything to Tq, which is the only way to finish the library call to 
MPI.Wait. 

For that reason we have to delay the execution of MPI_Wait as long as we have 
other active threads on the same processor. In that case the current thread which 
calls a wait is suspended and the wait request is stored in a queue within the 
thread communicator. If there are no active threads anymore, i.e. all threads are 
suspended, we call MPI_Waitsome with the list of all open MPI requests. Those 
threads whose MPI requests are finished will be activated again. Note that in 
the case of finishing a wrong receive request, i.e. source or target identifier does 
not matches, we either exchange the request with that of a matching one and 
activate the other thread, or we put the data into the open send queue for a later 
request, initiate another receive request and possible call MPI_Waitsome again. 
In order to avoid too many open MPI requests we call MPI_Testall with that 
list at every call to wait. 



4 Other Communication Operations 

4.1 Synchronous Communication 

A synchronous send operation has to wait until the matching receive has been 
called. For the local, synchronous communication we therefore have to suspend 
the current thread if we insert a send request into the open send queue and re- 
activate the thread if a receive operation removes the request from the queue. If, 
on the other hand, there is already a matching request in the open receive queue, 
we can immediately continue the sending thread as the receive has already been 
initiated. 

For the global, synchronous communication it is not sufficient to use a syn- 
chronous MPI operation, as the transfered message may be received by a wrong 
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thread with a non-matching thread identifier. In order to inform the sender that 
the data has been received by the correct thread we have to send back an ex- 
plicit acknowledgment message, so a synchronous send operation corresponds 
to sending the data and receiving the acknowledgment. This acknowledgment is 
sent in a different communicator such that it cannot interfere with normal data 
messages, and uses a message counter as tag to enable waiting for different mes- 
sages at the same time. As the receiver has to know whether he has to send an 
acknowledgment or not, every message does not only contain the data together 
with source and target thread identifier, but also the message counter. It has the 
specific value N0_ACK = -1 in case of an asynchronous operation. 

In the current version all global messages between different threads as well 
as the acknowledgment messages are sent as separate MPI messages. This might 
be optimized if a message aggregation technique as described in 0 were used, 
such that messages between different threads but the same processors were sent 
together as one big message. 



4.2 Collective Operations 

Due to the presence of local thread communication we cannot use the MPI 
routines for collective operations but have to re-implement them ourselves. 

A direct use of MPI is possible if there is exactly one thread on all processors. 
In that special case not only the collective operations but also all the other 
functions are mapped one-to-one to MPI, as dealing with threads is not necessary 
then. 

5 Conclusion 

We have described the implementation of a thread communication library on 
top of MPI. It allows light-weight threads to communicate with each other both 
locally between threads within the same process as well as globally between 
threads on different processors. The interface is similar to MPI with thread 
identifiers instead of processor ranks. 

Local communication is realized via open send and receive queues, while 
global point-to-point communication is mapped to MPI. As MPI only uses the 
communicator, the processor rank and an integer tag to distinguish different 
communication requests, we are unable to specify the thread identifiers of source 
and target that way. Therefore we have to send them together with the actual 
data. As a result a thread can receive wrong messages which should have been 
received by a different thread, as MPI_IRecv cannot differentiate between them. 
This has to be resolved explicitly by putting the data which were received too 
early into the open send queue and call MPI_IRecv again until the correct data is 
received. It also makes it necessary to explicitly send an acknowledgment message 
in case of synchronous communication, as the synchronization provided by MPI 
is not sufficient here. If MPI would provide a more flexible tagging mechanism, 
e.g. a array of tags, this overhead could be avoided and left to MPI. 
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Performance results demonstrate interesting improvements compared to MPI. 
A simple ping-pong test executed 20.000 times with 2 threads on one proces- 
sor takes 12.4 seconds for the thread communication and another second for 
program start and thread initialization. MPICH needs almost 2 seconds for the 
program initialization, and the communication phase takes 169.4 seconds. For 
global communication we have encountered a small overhead, but this is out- 
weighted by the larger process creation time if we generate threads quite often 
in the program as in |^ . 
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Abstract. HARNESS is an experimental Java-centric metacomputing system 
based upon the principle of dynamic reconfigurability not only in terms of the 
computers and networks that comprise the virtual machine, but also in the 
capabilities of the VM itself. In HARNESS, as in any other metacomputing 
systems, providing a consistent naming is a fundamental issue and the naming 
service is a pillar for any other service provided. HARNESS provides a two 
level naming scheme that separates virtual machine names from service names. 
In this paper we describe a simple yet fault tolerant implementation of the 
naming service dedicated to virtual machine names. 



1 Introduction 

HARNESS [1] is a metacomputing framework that is based upon several experimental 
concepts, including dynamic reconfigurability and fluid, extensible, virtual machines. 
HARNESS is a joint project between Emory University, Oak Ridge National Lab, and 
the University of Tennessee, and is a follow on to PVM [2], a popular network-based 
distributed computing platform of the 1990’s. The underlying motivation behind 
HARNESS is to develop a metacomputing platform for the next generation, 
incorporating the inherent capability to integrate new technologies as they evolve. The 
first motivation is an outcome of the perceived need in metacomputing systems to 
provide more functionality, flexibility, and performance, while the second is based 
upon a desire to allow the framework to respond rapidly to advances in hardware, 
networks, system software, and applications. Both motivations are, in some part, 
derived from our experiences with the PVM system, whose monolithic design implies 
that substantial re-engineering is required to extend its capabilities or to adapt it to 
new network or machine architectures. 

HARNESS attempts to overcome the limited flexibility of traditional software 
systems by defining a simple but powerful architectural model based on the concept 
of a software backplane. The HARNESS model is one that consists primarily of a 
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kernel that is configured, according to user or application requirements, by attaching 
“plug-in” modules that provide various services. Some plug-ins are provided as part 
of the HARNESS system, while others might be developed by individual users for 
special situations, while yet other plug-ins might be obtained from third-party 
repositories. By configuring a HARNESS virtual machine using a suite of plug-ins 
appropriate to the particular hardware platform being used, the application being 
executed, and resource and time constraints, users are able to obtain functionality and 
performance that is well suited to their specific circumstances. Eurthermore, since the 
HARNESS architecture is modular, plug-ins may be developed incrementally for 
emerging technologies such as faster networks or switches, new data compression 
algorithms or visualization methods, or resource allocation schemes - and these may 
be incorporated into the HARNESS system without requiring a major re-engineering 
effort. 

The generality and the level of dynamicity achieved by the HARNESS framework 
impose very stringent requirements on the naming service. In fact, in order to be able 
to manage geographically distributed resources while tracking the evolution of a 
virtual machine HARNESS needs a global name space that is timely and consistently 
updated. To limit the level of complexity of naming services HARNESS separates the 
problem of dealing with a global set of computational resources from the one of 
tracking the changes of service sets by generating a two level name space. In the first 
level HARNESS keeps track of all the virtual machines currently active and 
guarantees the uniqueness of virtual machine names. This level allows nodes willing 
to join a virtual machine to find out if the virtual machine exists and if this is the case 
to locate it. Each virtual machine name contained in this space is a key to access a 
second level name space where uniqueness of service names and timely updating of 
the set of services available is guaranteed. This level allows applications and services 
to locate other services inside a virtual machine. 

In this paper we will focus on the problem of generating and maintaining the first 
level name space, i.e. the space containing the names of HARNESS virtual machines. 

This paper is structured as follows: in section 2 we begin with an overview of the 
HARNESS model and implementation; in section 3 we briefly outline the current 
implementation of HARNESS naming service and we describe our implementation of 
a fault-tolerant name space for HARNESS; finally, in section 4 we provide some 
concluding remarks. 



2 Architectural Overview of HARNESS 

The fundamental abstraction in the HARNESS metacomputing framework is the 
Distributed Virtual Machine (DVM) (see figure 1, level 1). Any DVM is associated 
with a symbolic name that is unique in the HARNESS name space, but has no 
physical entities connected to it. Heterogeneous Computational Resources may 
enroll into a DVM (see figure 1, level 2) at any time, however at this level the DVM 
is not ready yet to accept requests from users. To get ready to interact with users and 
applications the heterogeneous computational resources enrolled in a DVM need to 
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Figure 1 Abstract model of a HARNESS Distributed Virtual Machine 

load plug-ins (see figure 1, level 3). A plug-in is a software component implementing 
a specific service. By loading plug-ins a DVM can build a consistent service baseline 
(see figure 1, level 4). A service provided by a loaded plug-in is associated with a 
name that is unique in the DVM name space. Users may reconfigure the DVM at any 
time (see figure 1, level 4) both in terms of computational resources enrolled by 
having them join or leave the DVM and in terms of services available by loading and 
unloading plug-ins. 

The main goal of the HARNESS metacomputing framework is to achieve the 
capability to enroll heterogeneous computational resources into a DVM and make 
them capable of delivering a consistent service baseline to users. This goal require the 
programs building up the framework to be as portable as possible over an as large as 
possible selection of systems. The availability of services to heterogeneous 
computational resources derives from two different properties of the framework: the 
portability of plug-ins and the presence of multiple searchable plug-in repositories. 
HARNESS implements these properties mainly leveraging two different features of 
Java technology. These features are the capability to layer a homogeneous 
architecture such as the Java Virtual Machine (JVM) [3] over a large set of 
heterogeneous computational resources, and the capability to customize the 
mechanism adopted to load and link new objects and libraries. 

The adoption of the Java language has also given us the capability to tune the trade- 
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off between portability and efficiency for the different components of the framework. 
This capability is extremely important, in fact, although portability at large is needed 
in all the components of the framework, it is possible to distinguish three different 
categories of components that requires different level of portability. The first category 
is represented by the components implementing the capability to manage the DVM 
status and load and unload services. We call these components kernel level services. 
These services require the highest achievable degree of portability, as a matter of fact 
they are necessary to enroll a computational resource into a DVM. The second 
category is represented by very commonly used services (e.g. a general, network 
independent, message-passing service or a generic event notification mechanism). We 
call these services basic services. Basic services should be generally available, but it 
is conceivable for some computational resources based on specialized architecture to 
lack them. The last category is represented by highly architecture specific services. 
These services include all those services that are inherently dependent on the specific 
characteristics of a computational resource (e.g. a low-level image processing service 
exploiting a SIMD co-processor, a message-passing service exploiting a specific 
network interface or any service that need architecture dependent optimization). We 
call these services specialized services. For this last category portability is a goal to 
strive for, but it is acceptable that they will be available only on small subsets of the 
available computational resources. These different requirements for portability and 
efficiency can optimally leverage the capability to link together Java byte code and 
system dependent native code enabled by the Java Native Interface (JNI) [4]. The JNI 
allows to develop the parts of the framework that are most critical to efficient 
application execution in ANSI C language and to introduce into them the desired 
level of architecture dependent optimization at the cost of increased development 
effort. 

The use of native code requires a different implementation of a service for each 
type of heterogeneous computational resource enrolled in the DVM. This fact implies 
a larger development effort. However, if a version of the plug-in for a specific 
architecture is available, the HARNESS metacomputing framework is able to fetch 
and load it in a user transparent fashion, thus users are screened from the necessity to 
control the set of architectures their application is currently running on. To achieve 
this result HARNESS leverages the capability of the JVM to let users redefine the 
mechanism used to retrieve and load both Java classes bytecode and native shared 
libraries. In fact, each DVM in the framework is able to search a set of plug-ins 
repositories for the desired library. This set of repositories is dynamically 
reconflgurable at mn-tlme: users can add or delete repositories at any time. 



3 Naming Services in the HARNESS System 

At present the HARNESS metacomputing framework provides two implementations 
for the virtual machines name space. 

The first implementation is based on IP multicast and adopts a peer-to-peer discovery 
protocol similar to the one used in the Jini system [5]. This implementation is 
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extremely resilient to failures in fact no component represents a single point of 
failure. However, the strengths of this implementation are also its limits. In fact while 
IP multicast works extremely well over a single LAN, it is extremely unreliable and 
suffer from sever scalability problems over WANs. For these reasons this 
implementation of the naming service fits only the scenario where the computational 
resources of a single entity (e.g. an enterprise, a University, etc.) can be dynamically 
enrolled into HARNESS virtual machines. 

The second implementation is based on a naming server residing on a well known 
host and accepting connections to a well known port. This implementation overcomes 
one of the limitations imposed by the first implementation, in fact it can easily 
provide a name space for computational resources distributed over the whole Internet. 
However, the price of this result is the injection of a single point of failure in the 
system architecture. In fact, if the name server is not available it is possible neither to 
enroll additional nodes into existing virtual machines nor to create new virtual 
machines*. 

To overcome the limitations of the two available implementations we have developed 
a distributed naming service. Our design is based on some fundamental assumptions: 

1. the routing of IP is designed in such a way that the property of a node to be 
reachable from another node is symmetric and transitive; 

2. any situation that negates assumption 1 is transient; 

3. it is acceptable for the name space to experience short, temporary splitting as 
long as the steady state guarantees uniqueness and consistency. 

Assumption one means two things. First, the fact that A is reachable from B also 
implies B is reachable from A; second, the fact that A is reachable from B and B is 
reachable from C also means that A is reachable from C. To our knowledge, the only 
non transient situations where this property can be negated permanently are generated 
by the use of firewalls. However, this does not represent a major limitation of our 
design, in fact our implementation is based on the TCP protocol and statically 
configured ports, thus it is possible to configure the firewall so that the packets 
directed to our naming service are not filtered out. Assumption three allowed us to 
avoid the large overhead required by a fully distributed, atomically updated naming 
space. 

In our implementation each HARNESS kernel is configured with a list of couples 
node/port that identify the HARNESS Name space Servers (HNS) that the kernel can 
inquire. The list is ordered according to the IP addresses of the HNS so that a lower IP 
will always be contacted before a higher IP. A kernel that needs to contact an HNS 
will start trying each of the listed servers. If an HNS is not running the kernel will 
simply time-out and it will try the next one. If an HNS is running the kernel can get 
two different replies: an acknowledgement that the contacted HNS is the current list 
master or a redirect reply. 

If the kernel gets a redirect it will keep on going through the list. Once the kernel has 



* It is important to notice, however, that existing virtual machines maintain all the remaining 
functionality, i.e. on the currently enrolled nodes services can still be used, added and 
removed freely. 




A Simple, Fault Tolerant Naming Space for the HARNESS Metacomputing System 157 




Figure 2 Finite state automaton describing the behavior of each of the HNS 



reached the end of the list it will start again from the head up to five times. After five 
full list scan it will give up and abort. 

It is important to notice that the decision about the faulty state of an HNS is not 
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permanent. On the contrary the status of a faulty HNS will be probed again as soon as 
one of the following two events takes place: a non master HNS is contacted, there as 
been no probe for a configurable amount of time. This fact guarantees that it is not 
possible that a transient failure is wrongly assumed to be a permanent failure. 

At any given time only one HNS is in charge of the name space, every other HNS 
will refuse to service kernels and will redirect them. In figure 2 you can see the finite 
state automata that describes the behavior of each of the HNS. Each HNS bootstraps 
in the slave status. The reception of a query from a kernel will trigger the process of 
checking who is the current list-master. This checking consists of pinging the ordered 
HNS list to see if there is a running HNS with an IP lower than the one that is 
currently performing the check. If there is one it will remain the list-master and the 
one currently checking will stay in the slave state. If there is none, the one currently 
checking will make the transition to master state. 

It is possible for the check for the current list-master to require a long time, in fact the 
worst case is represented by the situation where, with a list of N HNS, the N-1 HNS 
with the lower IPs are not running. In this situation the Nth HNS needs to time-out on 
a TCP connect N-I times in order to get list-master status. However, this check is 
performed only when an HNS in slave state receives a query from a kernel, thus the 
overhead required by this process is not injected into each name space query. On the 
contrary, it is only incurred into at the time of the first query and in the case of list- 
master faults. 

Our scheme does not copy all the HARNESS virtual machine names information to 
the HNS in the slave state. However, in the case of a list-master fault the refresh 
mechanism built-into the HARNESS kernel will automatically reconstruct the 
complete set into the newly elected list-master. This process will take no more than 
one refresh timeout. During this time frame the HARNESS virtual machines name 
space is in an inconsistent state, in fact it is possible for a kernel to start a second copy 
of an existing virtual machine as its existence has not been copied into the new list- 
master yet. However, this inconsistent, split-brain-like state is strictly transient and it 
will be automatically removed as soon as the name space is refreshed. In fact, the 
virtual machine trying to refresh the faulting HNS will detect the fault, contact the 
new list-master and receive a notification that there is another set of nodes running as 
the same virtual machine. The nodes enrolled in the original virtual machine will 
automatically join the new virtual machine and the system will be in a consistent state 
again. It is important to notice that this sequence of events will be perceived by the 
services and applications running in the virtual machine only as a growth of the set of 
nodes enrolled in the virtual machine, thus this process will not prevent the software 
components from performing the currently ongoing activities. 

It is our opinion that in most cases the advantages of a fast, low overhead service in 
the steady state largely compensate for the problems that can be caused by these 
transient inconsistencies. 
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4 Conclusions 

In this paper we have described a simple, yet fault-tolerant implementation of the 
HARNESS naming service dedicated to the tracking and management of virtual 
machine names. Our implementation overcomes the limitations of the naming space 
implementations currently available for HARNESS, namely either the presence of a 
single point of failure or limited usability in WAN connected scenarios, without 
introducing the large overhead needed by a fully distributed, atomically updated 
naming service. In fact, when operating in steady state, our implementation has an 
operational overhead as low as the one introduced by a centralized solution. This 
result is achieved by relaxing the atomicity constraint and allowing transient 
inconsistency in the namespace to happen. However, return to a consistent steady 
state in a finite amount of time is guaranteed as long as there are no asymmetric 
connectivity patterns such as the ones generated by firewalls configured to filter out 
HARNESS naming queries. The length of the time period required to return to 
consistent state is controlled by a configurable time-out. 
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Abstract. The MPI standard provides a way to send and receive com- 
plex combinations of datatypes (e.g., integers and doubles) with a single 
communication operation. The MPI standard specifies that the type sig- 
nature, that is, the basic datatypes (language-defined types such as int 
or DOUBLE PRECISION), must match in communication operations such 
as send/receive or broadcast. Because datatypes may be defined by the 
user in MPI, there is a limitless collection of possible type signatures. 
Detecting the programmer error of mismatched datatypes is difficult in 
this case; detecting all errors essentially requires sending a complete de- 
scription of the type signature with a message. This paper discusses an 
alternative: send the value of a function of the type signature so that (a) 
identical type signatures always give the same function value, (b) dif- 
ferent type signatures often give different values, and (c) common cases 
(e.g., predefined datatypes) are handled exactly. Thus, erroneous pro- 
grams are often (but not always) detected; correct programs never are 
flagged as erroneous. The method described is relatively inexpensive to 
compute and uses a small (and fixed, independent of the complexity of 
the datatype) amount of space in the message envelope. 



1 Introduction 

The Message Passing Interface (MPI) |31 12] provides a standard and portable 
way of communicating data from one process to another, even for heteroge- 
neous collections of computers. A key part of MPI’s support for moving data 
is the description of data not as a series of undifferentiated bytes but as typed 
data corresponding to the datatypes natural to the programming language being 
used with MPI. Thus, when sending C ints, the programmer specifies that the 
message is made up of type MPI_INT (because MPI is a library rather than a 
language extension, MPI cannot use the same names for the types as the pro- 
gramming language). MPI further requires that the type of the data sent match 
the type of the data received; that is, if the user sends MPI_INTs, the user must 

* This work was supported by the Mathematical, Information, and Computational 
Sciences Division subprogram of the Office of Advanced Scientihc Computing, U.S. 
Department of Energy, under Contract W-31-109-Eng-38. 
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receive MPI_INTsfl MPI also allows the definition of new MPI datatypes, called 
derived types, by combining datatypes with routines such as MPI_TYPE_VECTOR, 
MPI_TYPE_STRUCT, and MPI_TYPE_HINDEXED. Because the matching of basic types 
is required for a correct program, a high-quality development environment should 
detect when the user violates this rule. This paper describes an efficient method 
for checking that datatype signatures match in MPI communication. 

One reason such error checking is important for MPI programs is that MPI 
allows messages containing collections of different datatypes to be communicated 
in a single message. Further, the sender and receiver are often in different parts 
of the program, possibly in different routines (or even programs). User errors 
in the use of MPI datatypes are thus difficult to find; adding this information 
can catch errors (such as using the same message tag for two different kinds of 
messages) that are difficult for the user to identify by looking at the code. 

An additional complexity is that MPI requires only that the basic types of 
the data communicated match for example, that ints match ints and chars 
match chars. This ordered set of basic datatypes (i.e., types that correspond to 
basic types supported by the programming language) is called the type signature. 
The type signature is a tuple of the basic MPI datatypes. For example, three 
ints followed by a double is 

(MPI_INT, MPI_INT, MPI_INT, MPIJDOUBLE). 

A type signature has as many types as there are elements in the message. This 
makes it impractical to send the type signature with the message. 

MPI also defines a type map] for each datatype, a displacement in memory 
is given. While the type map specifies both what and where data is moved, 
a type signature specifies only what is moved. Only the signatures need to 
match; this allows scatter/gather-like operations in MPI communication. For 
example, it is legal to send 10 MPI_INTs but receive a single vector (created with 
MPI_TYPE_VECTDR) that contains at least 10 MPI_INTs. Communicating with dif- 
ferent type maps is legal as long as the type signatures are the same. Thus, it 
isn’t correct to check that the datatypes match; only the type signatures must 
match. 

Note that when looking at the type signature, the comparison is made with 
the basic types, even if the type was defined using a combination of derived 
datatypes. Thus, when looking at the type signature, any consecutive subse- 
quence may have come from a derived datatype. 

Consider the derived type t2 defined by the following MPI code fragment: 



^ Two exceptions to this rule are mentioned in Section^ A third, mentioned in the MPI 
standard, is for the MPI implementation to cast the type; for example, if MPI_INT 
is sent but MPI_FL0AT is specified for the receive, an implementation is permitted 
to convert the integer to a float, following the rules of the language. As this is not 
required, it is nonportable. Further, no MPI implementation performs this conver- 
sion, and because it silently corrects for what is more likely a programming error, 
no implementation is ever likely to implement this choice. 
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MPI_Datatype tl, t2, type 
int blen [2] ; 

MPI_Aint displ[2]; 

types [0] = MPI_INT; 
blen[0] = 1; 
displ [0] = . . . ; 
MPI_Type_struct ( 2, blen, 
types [0] = tl; 
blen[0] = 2; 

MPI_Type_struct ( 2, blen, 



[ 2 ]; 



types [1] = MP1_DDUBLE; 
blen[l] = 1; 
displ [1] = . . . ; 
displ, types, &tl ); 
types [1] = MP1_SHDRT; 

displ, types, &t2 ); 



The derived type t2 has the type signature 

( (MP1_1NT, MP1_D0UBLE), (MP1_1NT, MP1_D0UBLE) , MP1_SH0RT ) = 

( MP1_1NT, MP1_D0UBLE, MP1_1NT, MP1_D0UBLE, MP1_SH0RT ). 

The approach in this paper is to define a hashing function that maps the 
type signature to an integer tuple (the reason for the tuple is discussed in Sec- 
tion □D. The communication requirement is thus bounded independent of the 
complexity of the datatype; further, the function is chosen so that it can be 
computed efficiently; finally, in most cases, the cost of computing and checking 
the datatype signature is a small constant cost for each communication opera- 
tion. Since this approach is a many-to-one mapping, it can fail to detect an error. 
However, the mapping is chosen so that it never erroneously reports failure. Fur- 
ther, for the important special case of communication with basic datatypes (e.g., 
MPlJDOUBLE), the test succeeds if and only if the type signatures match. 

Other approaches are possible. The datatype definitions (just enough to re- 
produce the signature, not the type map) could be sent, allowing sender and 
receiver to agree on the datatypes. The definitions could be cached, allowing 
a datatype to be reused without resending its definition. The special case of 
(count, datatype) would reduce the amount of data that needed to be communi- 
cated in many common cases. Still, comparison of different datatypes in general 
would be complex, even if common patterns were exploited. Another approach 
is to send the complete type signature; this is the only approach that will catch 
all failures (various compression schemes can be used to reduce the amount of 
data that must be sent to describe the type signature, of course). Such an ap- 
proach could be implemented over MPI by using the MPI-2 routines to extract 
datatype definitions, along with the MPI profiling interface. For systems with 
some kind of globally accessible memory, such as the Cray T3D, it is possible 
to make all datatype definitions visible to all processes, as in The approach 
described in this paper offers several advantages. Perhaps most important, it is 
simple, requiring very little change to an MPI implementation. Sending the en- 
tire datatype signature, even if compressed, requires the MPI implementation to 
handle variable-length header data. In addition, even with compression, sending 
the full datatype signature can significantly increase the time to send a message; 
even in debugging mode, users prefer minimal extra overhead. 
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2 Datatype Hashing Function 

We are looking for a function / that converts a type signature into a small 
bit range, such as a single integer or pair of integers. The cost of evaluating 
/ should be relatively small; in particular, the cost of evaluating / for a type 
signature containing n copies of the same type (derived or basic) should be o(n); 
for example, log n. Because a type signature may contain an arbitrary number 
of terms, the easiest way to define / is by a binary operation applied to all of 
the elements of the type signature. That is, define a binary operation 0 that can 
be applied to a type signature (oi, . . . , a„) as follows: 



f{ai) — «i 
n 

/((ai, 02, • ■ • , «„)) = ai- 

i=l 



For example, 

f{int, double) = (int) 0 (double) 

and 

f(int, double, char) = (int) 0 (double) 0 (char). 

In order to make it inexpensive to compute the hash function for datatypes 
built from an arbitrary combination of derived datatypes, the hash function must 
be associative. Since we want (int, double) to hash to a different value from 
(double , int) , we want the operation 0 to be noncommutative. For this ap- 
proach to be useful, the hash function must hash different datatypes to different 
hash values, particularly in the case of “common” errors, such as mismatched 
predefined datatypes. 

3 A Simple Datatype Hashing Function 

We need an operation that is both associative and noncommutative. Our ap- 
proach is to define a tuple (a, n) where a is a datatype (derived or basic) and n 
is the number of basic datatypes in a. We start with the predefined datatypes, 
representing, for example, MP1_1NT as (ai„t,l), where ai„t is a integer value. 
The tuple for a derived datatype is then constructed by applying the operator 
0 , whose action is given by 

(a, n) 0 ((3, m) = (a + (f3 « n),n + m), 

where the operators + and << are chosen to have the following properties: 



(a « n) « m = a << (n + m) 


( 1 ) 


(a 0 /3) 0 7 = a 0 (/3 0 7 ) 


( 2 ) 


(a « n) 0 (P « n) = (a + P) « n. 


(3) 
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One choice for these operators is bitwise exclusive or (xor) for + and circular 
left shift for <<. These operations are often chosen for hash functions because 
they are very cheap to apply. They have the necessary properties, as can be 
proven by writing the a and so forth as bit vectors and then applying the op- 
erations xor and circular shift to those bit vectors. Another choice of operators 
is integer addition modulo 2^^ for -|- and circular left shift by 3 for << (that is, 
a << 1 is a, shifted left three bits). 

These properties allow us to prove that the operation 0 is associative: 

((a,n) 0 (/3 ,to)) 0 (7,p) = 

((a0 (/3 << n),n + m)) 0 (7,p) = 

((a 0 (f 3 « n) 0 (7 << n0m),n0m0p) = 

((a 0 ((/3 0 (7 << m) « n)), (n 0 (m 0p)) = 

{{a,n) 0 (/30 (7 << m),m0p)) = 

((a,n) 0 {{ 13 , m) 0 (7,p))). 

The operation 0 is not commutative: 

(a, n) 0 {( 3 , m) = 

{a 0 {P « n),n + m) 

{P, m) 0 (a, n) = 

{P + {a « m) ,n + m) , 

but 

{a 0 {P « n)) yf (/3 0 (a << m)) 
except in special cases. 

Note that addition and xor by itself are commutative; the shift operation 
provides a noncommutative operation. 

We will use this operation to build /. Specifically, we will apply 0 to a type 
signature where we have replaced every basic type with a tuple containing an 
integer representing the type and a one, indicating a single basic type. That is, 

{int, double, char) 



becomes 



{{int, 1), {double, 1), {char, 1)) 



and 



f{{int, double, char)) = {int, 1) 0 {double, 1) 0 {char, 1). 



3.1 Cost of Evaluating / 

Several identities can be used to reduce the cost of computing /. One important 
case is a type signature containing a large number of the same basic type. This 
is the signature that represents the most common MPI usage: a send with a 
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basic datatype and a count that is greater than one. Using a method that is 
very similar to the approach for evaluating integer powers of matrices, we can 
compute 0™i(o,ri) in 0(log(m)) time by induction. Let m be 2^ for some k. 
Then 



m / mj2 \ / mj2 

0(a>^) = 0(a)"-) ® 0(a>^) 
i=l \i=l J 

the terms on the right are evaluated by induction. This can be evaluated with 
log 2 m evaluations. The generalization to arbitrary m is left to the reader. 

Further, note that v << n = v « {n + wordsize)=v << (n mod wordsize); 
this can be used to reduce the cost of evaluating /. 

Finally, by exploiting the associative property of 0, evaluating / for a new 
derived datatype involves only the values of / for the datatypes that make up 
the new datatype (with the exception of those containing types MPI_PACKED or 
MPI_BYTE; see Section^). Thus, computing / for a datatype has cost proportional 
only to the number of different datatypes (either user-defined or basic) used in 
the definition and proportional to the log of the number of instances of each 
datatype. 




3.2 Hash Function Quality 

For the hash function to be useful, collisions should be rare. Since in a typi- 
cal program, MPI type signatures are not randomly distributed, it makes the 
most sense to experimentally evaluate some common datatype patterns. Fur- 
ther, while there are 13 distinct basic MPI datatypes in the C binding, most 
programs use only a few types, such as MPI_INT and MPIJDDUBLE. Types such 
as MPI_UNSIGNED_CHAR are rarely used. Thus, for most applications, only a few 
basic datatypes will appear. To see how likely a collision in the hash function 
might be, we tested the following patterns: 



n : ai 


(4) 


m : {1 : at, (n — 1) : aj) 


(5) 


1 : «i, m : (1 : Oi, (n-l):aj), 


(6) 



where n : x means n copies of x. These correspond to the cases of count (n) of 
a basic datatype 0), count m of a structure containing n members (0, and a 
structure containing count m of another structure OOl). Various values of n and 
m were used. 

Table H] shows the results of the tests. Clearly, only the choice of integer 
addition with medium-sized integers provides an effective hash function; with 
this choice, only one in one hundred different type signatures hashed to the 
same value. Further experiments may identify improved hash functions. 



166 William D. Gropp 



Table 1. Results of tests of the hash function. Collisions is the percentage of 
type signatures whose hash value was the same as a different type signature. 
Duplicates gives the percentage of hash values that were duplicated. Operand 
indicates whether the representation for a basic datatype is a small integer (less 
than 32) or a larger integer (less than 2^®). We tested 4625 different type signa- 
tures. 



Operatorl Operator2 Operand 


Collisions Duplicates 


xor 


rotate 1 


small 


57.4 


13.4 


xor 


rotate 3 


small 


48.9 


10.5 


-f 


rotate 1 


small 


24.9 


11.5 


-f 


rotate 3 


small 


29.4 


10.3 


xor 


rotate 1 


medium 


45.6 


9.8 


-f 


rotate 1 


medium 


1.2 


0.58 



3.3 Improving the Type Signature Test 

One modification of the approach is to optimize for the special case of count 
copies of a datatype (basic or otherwise), since this is the fundamental unit in 
MPI (all MPI communication operations send count copies of a given datatype) . 

In this case, we send {count, a, n). The modified test is shown in Figure 0 
Note that the count applied in the receive case is the actual count, not the 
maximum count that is provided by user in the MPI_RECV call. In addition, we 
do not need to send the count separately; we can simply use a single bit to 
indicate that the datatype is basic and the count can be computed, if necessary, 
from the length of data sent. With this modification, basic datatypes are handled 
exactly (all errors are detected). 



if {OLsend i — ^recv') then 

if {oisend and Orectj is basic) then error 
else if (0“™*“'”‘^(asend,nsend) ! = 

0““"*’''°”(arec«,nrec-u)) then error 

endif 

endif 



Fig. 1. Modification to test to provide exact handling of the most common case. 



4 Limitations 

MPI allows users to send partial datatypes. That is, the user can define a 
datatype representing, for example, an int followed by ten doubles, and re- 
ceive this into a datatype of an int followed by fifty doubles, as long as the 
type signature of the data that is sent matches the type signature at the receiver 
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for all of the types that are used. This allows the user to define a maximum-sized 
datatype on the receive end but an actual sized datatype on the send end. 

In MPI, the user can detect this by examining the MPI_Status value returned 
by the receive. If the routine MPI_GET_CDUNT returns MPI .UNDEFINED, then the 
routine MPI.GET JiLEMENTS may be used to determine how many elementary (pre- 
defined) MPI datatypes were sent. In the case above, MPI.GETJILEMENTS would 
return eleven (one int plus ten doubles). Our test does not handle this. Thus, 
it must also test for MPI_GET_COUNT being MPI.UNDEFINED; in that case, the test 
passes (even if the type signature do not, in fact match). In principle, a cor- 
responding value of / could be constructed by using the same process that is 
used in an MPI implementation to evaluate MPI.GETJILEMENTS; by integrating 
the computation of / with this routine, this test can be performed with low 
additional cost. 

The MPI datatype MPUACKED and MPI.BYTE also present special problems 
whose full discussion would take too long. In short, data sent with MPUACKED is 
first packed incrementally into a user-defined buffer using the routine MPIJ^ACK. 
The function / must thus also be accumulated incrementatlly; one possibility is 
to use the header of the packed buffer. The more complex case of MPI derived 
datatypes that contain MPUACKED can also be handled, though here the func- 
tion / must be evaluated when the data is sent rather than when the datatype 
is created. MPI.BYTE explicitly turns off type signature matching and is best 
handled with a reserved hash value (e.g., OxFFFFFFFF,-!). 

5 Conclusion 

We have shown an efficient way to catch many user errors caused by type sig- 
nature mismatch at run time in MPI programs. The cost is relatively small; 
consuming only an additional 32 to 64 bits (4 to 8 bytes) of message header and 
evaluation cost that is bounded by 0(m log n) for derived datatypes containing 
m different types with repeat count < n. The most common cases (count of a 
basic datatype) take constant time. We note that this approach can be used for 
any system that incrementally packs and unpacks data, such as XDR or PVM. 
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Abstract. We present a process management system for parallel pro- 
grams such as those written using MPI. A primary goal of the system, 
which we call MPD (for multipurpose daemon), is to be scalable. By this 
we mean that startup of interactive parallel jobs comprising a thousand 
processes is quick, that signals can be quickly delivered to processes, and 
that stdin, stdout, and stderr are managed intuitively. Our primary 
target is parallel machines made up of clusters of SMPs, but the system 
is also useful in more tightly integrated environments. We describe how 
MPD enables much faster startup and better runtime management of 
MPICH jobs. We show how close control of stdio can support the easy 
implementation of a number of convenient system utilities, even a parallel 
debugger. MPD is implemented and freely distributed with MPICH. 



1 Introduction 

A parallel programming environment may be viewed as comprising three inter- 
acting components: a job scheduler, which decides what resources a parallel job 
consisting of multiple processes will run on; a process manager, which starts and 
terminates processes and provides them with a number of services; and a parallel 
library such as MPI, which a parallel application calls upon for communications. 
Since these components need to communicate with one another, they are often 
integrated into a single system. An important research question is to what ex- 
tent they can be separated from one another with well-defined interfaces so that 
they can be independently developed. A further research question is whether the 
resulting system can be made scalable to jobs involving thousands of commu- 
nicating processes. In this paper we focus on the process manager component. 
We describe a design and an implementation we call MPD (for multipurpose 
daemon) that provides both fast startup of parallel jobs and a flexible run-time 
environment that supports parallel libraries. 

In Section we summarize related work. In Section 0 we state our explicit 
design goals, how these goals lead to implementation decisions, and interesting 
features of the resulting system, including how it can be used to create a par- 
allel debugger out of an existing single-process debugger. Section ^ summarizes 

* This work was supported by the Mathematical, Information, and Computational Sci- 
ences Division subprogram of the Office of Advanced Scientific Computing Research, 
U.S. Department of Energy, under Contract W-31-109-Eng-38. 
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preliminary experiments that make us optimistic about the usefulness of MPD 
as a process manager for large-scale systems. We conclude with a summary of 
progress to date and a description of our future plans. 

The MPD system is in use and is available as open source as part of the 
MPICH system, obtainable from http://www.mcs.anl.gov/mpi/mpich. 



2 Related Work 

All parallel computing environments that support execution of truly parallel 
programs (those in which any two processes can communicate with one another) 
have had to address at least some of the issues that we address with MPD. 
Parallel programming systems, such as PVM Cni, P4 0, and implementations 
of MPI such as MPICH ^21 3.nd LAM ^ all provide some mechanism for starting 
and running parallel programs, often with a specialized daemon process. 

Many systems are intended to manage a collection of computing resources 
for both single-process and parallel jobs; see the survey by Baker, et. al. |3|. 
Typically, these use a daemon that manages individual processes, with emphasis 
on jobs involving only a single process. Widely used systems include PBS im, 
LSF DQS |H|, and Loadleveler/POE jUj. The Condor system ^ is also 
widely used and supports parallel programs that use PVM m- Other, more 
specialized systems, such as MOSIX ^ and GLUnix El, provide a form of 
single-system image support for clusters. 

Harness um shares with MPD the goal of supporting management of par- 
allel jobs. Its primary research goal is to demonstrate the flexibility of the “plug- 
in” approach to application design, providing a wide range of services, whereas 
the MPD system focuses more specifically on the design and implementation of 
services required for process management of parallel jobs, including high-speed 
startup of large parallel jobs on clusters and scalable standard I/O management. 
The book |2j provides a good overview of metacomputing systems and issues. 



3 Design of MPD 

In this section we describe our goals in constructing MPD and outline the sys- 
tem’s architecture. 



3.1 Goals 

Several explicit goals have governed the design of the MPD system. 

Simplicity The persistent (across jobs) part of the system should be simple 
and robust. In the long run we expect this part to be runnable as root. If 
its behavior isn’t completely transparent, we will never be able to convince 
system administrators to run the daemons as root. 
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Speed Startup of parallel jobs should be quick enough to provide an interactive 
“feel,” so that large but short jobs make sense. Large (in number of processes) 
but short (in time) characterizes system utilities such as those described 
in m- Our immediate target is to start 1000 processes in a few seconds, 
while still providing a way for such processes to establish contact with one 
another. Our long-term goal is to support management of 10,000 processes. 
Robustness The persistent part of the system should be at least moderately 
fault tolerant. An unexpected crash of one machine should not bring down 
the whole system. There should be no single “master” process. 

Scalability The complexity or size of any component should not depend on the 
number of components. 

Individual Process Environments It should be possible to start a parallel 
job in which the executable files, environment variables, and command-line 
arguments are different for each process. It should be possible to collect 
return codes individually from processes. 

Collective Identity of a Parallel Job It should be possible to treat a par- 
allel job as a single entity that can be suspended, continued (signaled, in 
general), or killed collectively as if it were a single process. The system 
should manage stdin, stdout, and stderr in a useful and scalable way and 
allow them to be redirected as if the parallel job were a single process. An 
important component of a job’s collective identity is its termination. All re- 
sources allocated for the job, such as files. System V IPC’s, other processes, 
etc., must be reliably freed, even if the job terminates abnormally. 

It is explicitly not a goal of the MPD system to provide scheduling services, 
which we believe to be a separate function from process management. 

3.2 Deriving the Design from the Goals 

The goals of simplicity and robustness lead us to adopt a multicomponent sys- 
tem. The daemon itself is persistent (may run for weeks or months at a time, 
starting many jobs), typically one instance per host in a TCP-connected net- 
work. Manager processes will be started by the daemons to control the applica- 
tion processes {clients) of a single parallel job and will provide most of the MPD 
features. The goal of speed requires that the daemons be in contact with one 
another prior to job startup, and the goals of scalability and “no master” suggest 
that the daemons be connected in a ring0 The services that the managers will 
provide (see Section |^|) suggest that they be in contact as well, and the fastest 
way for them to form these connections is to inherit part of the ring connectivity 
of the daemons. Separate managers for each user process support the individual 
process environments. The goal of having a collective identity for a parallel job 
leads us to treat the mpirun or mpiexec process as such a representative, and 
use it to deliver signals and stdin to application processes and collect stdout 

^ While a ring is not ultimately scalable, it is more so than the typical star used in 
many process management systems, and our experiments have shown it feasible for 
the 1000-daemon domain. 
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and stderr output from them. This suggests that the mpirun process connect 
first to the daemon ring in order to start the job, and then switch the connection 
to the manager ring in order to control the job. The goal of speed suggests that 
these latter connections be restricted to a process running on the same host, 
either the daemon itself or a persistent gateway process if the daemon is run as 
root, so that authentication can be through the file system (a Unix rather than 
a network socket). We refer to all such processes as console commands. Finally, 
in order that this infrastructure be available to support MPI programs or other 
parallel tools, there needs to be client library that each application process may 
use to interact with its manager. 

We do not specify how the daemons are started or connected, since the system 
provides a number of alternatives, and the process need not be particularly fast. 
A console command is started by the user, either interactively or under the 
control of a batch scheduler. The daemons fork and exec the managers, which 
use information given them by the daemons to connect themselves into a ring, 
then fork and exec the clients. The startup messages traverse the ring quickly, 
so most forking, execing, and connecting take place in parallel, leading to fast 
startup even for large jobs. The situation is then as shown in FigureD where the 




Fig. 1. Daemons with console process, managers, and clients 



clients may be application MPI processes. Solid lines represent sockets, except for 
the vertical ones, which represent pipes. The dashed lines represent the trees of 
connections for forwarding stdout and stderr, and the dotted lines represent 
potential connections among the client processes. The dot-dashed line is the 
original connection from console to local daemon on a Unix socket, which is 
replaced during startup by the network connection to the first manager. 

3.3 Interesting Features 

Space restrictions prevent a complete description of all the features and capa- 
bilities of the MPD system, but in this section we mention a few highlights. 

Security Whenever a process advertises a “listener” socket and accepts connec- 
tions on it, the possibility exists that an unknown or even malicious process 
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will connect. This is particularly dangerous if the process accepting the con- 
nection can start processes as the MPD daemon can. We currently use the 
“challenge-response” system described in m- In the long run, we expect to 
modify this component of the system to use more elaborate schemes and ex- 
tend them to other connections such as client/gateway authentication. This 
will have little impact on the job startup speed since the daemon component 
startup is separate from job startup. 

Fault Tolerance If a daemon dies, this fact is detected and the ring is reknit. 
This provides a minimal sort of fault tolerance, since the ring remains intact. 
A new MPD daemon can be inserted in the ring where the old one was, but 
this process is not (yet) automatic. 

Signals Signals can be delivered to client processes by their managers. We cur- 
rently use this capability in two specific ways. First, signals delivered to a 
console process are propagated to the clients, so that a parallel application as 
a whole can be suspended with cntl-Z, continued, and killed with cntl-C, 
just as if it were a single process. Second, in the ch_p4mpd device in the 
MPICH implementation of MPI, client processes can interrupt one another 
with requests to dynamically establish client-to-client connections. Such re- 
quests go up into the manager ring from the originating client, around the 
ring to the manager of the target process, which signals its client. 

Support for MPI Implementations Currently MPD provides direct support 
for the MPICH implementation of MPI. The ch_p4mpd device distributed 
with Version 1.2 of MPICH makes direct calls to the client library compo- 
nent of the MPD system to find out a process’s rank, where other processes 
are and how to contact them, and so forth. In our next major release of 
MPICH, the support will be indirect, through a general parallel-library-to- 
process-manager interface we will describe elsewhere. 

On clusters of SMPs, it is easy to specify that multiple processes are to 
be started on the same machine and share memory. Specifically, mpirun -np 
180 -g 2 cpi starts processes in groups of two and places in their environ- 
ment a key that can be used to acquire group-attached shared memory and 
other information needed to set up multimethod communication for an MPI 
implementation. Other communication mechanisms (such as VIA) will be 
supported over time. 

Handling Standard I/O Mangers capture the stdout and stderr of their 
clients, and forward them up a pair of binary trees of socket connections, 
each manager merging stdout and stderr from its client with that from each 
of its two children. A command line option tells the managers to provide a 
rank label on each line of output from their clients. 

Standard input (to mpirun, for example) by default is delivered to the 
client managed by manager 0. This seems to be what most MPI users ex- 
pect, and what most MPI implementations do. (The MPI standard does not 
specify.) However, control messages can be used to change this behavior to 
direct stdin to any specific client or broadcast it to all clients. 

Client Wrapping The semantics of the Unix fork and exec system calls pro- 
vide useful benefits. When a manager forks a client process, for example, it 
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first sets up the manager-client pipes for control messages and standard I/O. 
The “lower” ends of these pipes are inherited by any process that the client 
forks. Thus, even though the client is not using any of the client library, man- 
agers can manage clients that themselves run the “real” application process. 
We call this scheme client wrapping. Thus mpirun -np 16 nice -5 myprog 
lowers the priority of a parallel job to be run on one’s colleagues’ worksta- 
tions, and mpirun -np 16 pty myprog can be used when myprog needs to 
be attached to a terminal (otherwise our capture of stdin and stdout modi- 
fies their buffering behavior). (The program pty is distributed with the MPD 
system.) 

Putting It All Together The combination of I/O management, especially 
redirection of stdin, line labels on stdout, and client wrapping can be sur- 
prisingly powerful. We have used these features of the MPD system to add 
an option to mpirun that invokes gdb as a client wrapper and dynamically 
redirects stdin. While mpirun -np 3 cpi runs cpi directly as an MPI job, 
mpirun -np 3 -d cpi runs each cpi process under the control of (wrapped 
by) the gdb debugger. (Other sequential debuggers could be used, but are 
not yet supported.) Thus multiple instances of gdb are being run. Output of 
the gdb’s is labeled by process rank. The “(gdb)” prompts are intercepted 
by the mpirun process and counted, so that it can issue an “(mpigdb)” 
prompt when one has been received from each process. In addition, mpirun 
-d uses the “z” command (one of the few single letters not already claimed 
by gdb) to redirect stdin to a specific gdb instance or to all processes. Thus 
processes can be stepped and breakpoints can be set either collectively or 
individually, and collectively printing a variable will provide all values with 
rank labels. An example terminal session showing how this works can be 
seen at http://www.mcs.anl.gov/mpi/mpich/mpd/mpigdb.script. 

4 Experiments 

Most development of MPD has been on workstation networks where startup of 
32-process jobs on five workstations is virtually instantaneous, compared with 
the approximately 1.5 seconds per process required by the ch_p4 version of 
MPICH. An early test of the feasibility of using the ring topology showed that 
a message could make 1024 hops around the ring in less than .4 seconds, which 
gave us confidence that the ring would not impose scalability limits, at least in 
the near term. Recently we began experiments on Chiba City, a testbed for par- 
allel computer science research jj. We performed one set of tests on 211 nodes 
connected by Fast Ethernet. We were interested only in process startup time, 
and so tested execution of trivial parallel jobs. Typical experiments included 

time mpirun -np 211 hostname 
time mpirun -np 422 -g 2 hostname 

We found that starting 211 processes (one on each node) and collecting the 
stdout output of hostname took about 2 seconds to execute. Starting twice as 
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many processes (one for each CPU) took about 3.5 seconds, including setting up 
the relatively complex stdout tree and collecting the output. Sending a message 
around the ring of 211 MPD daemons took only .13 seconds. More experiments 
are ongoing, and we will soon be able to report on MPI jobs on Chiba City. 

5 Future Development 

The existing MPD system, consisting of daemons, managers, console commands, 
and client library, meets our goals of simplicity, robustness, and scalability. It is 
used for fast startup of MPI jobs and others on systems with hundreds of ma- 
chines. The flexibility of its stdio control mechanism has provided unexpected 
benefits, such a “poor man’s” parallel debugger. It meets our goals for the col- 
lective identity of a parallel job. It does not yet meet all of our goals with respect 
to individual process environments, although that is coming very soon. 

In the near term, we expect to use the system to implement the dynamic 
process creation part of MPI-2 in MPICH. The design presented here, with a 
simple daemon and a separate manager process providing most of the features 
needed by user jobs, allows the daemons to be run as root while the managers are 
run as user processes. We expect to begin running the daemon as root on some 
large-scale multi-user systems, in order to provide a persistent job management 
system. This will require increased attention to security issues as well as a precise 
definition of how MPD will interoperate with a full-featured scheduling system 
such as the Maui scheduler |2|. We believe that the MPD daemons can also begin 
to provide more services, such as run-time performance monitoring. 

In the long run, as machines grow from hundreds to thousands of nodes, our 
rings of daemons and managers may have to grow into a more sophisticated 
structure, such as rings of rings, in order to continue to provide fast startup. We 
anticipate that this can be done without substantially changing the MPD design 
presented here. We will also need a more sophisticated output merger in order 
to provide scalable stdout, for example for large-scale parallel debugging. 

In summary, we are finding the MPD system already a useful contribution to 
one’s parallel programming environment and expect its applicability to expand 
in the near future. We also view its design as a valuable starting point for future 
research into large-scale parallel job execution environments. 
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Abstract. Most of MPFs implementations cope with the different un- 
derlying means of communication. More than just providing the ability 
to send a message through a certain protocol the implementations make 
use of specific features of a protocol to speed up message exchanging. 
These different communication protocols are integrated with each other 
and the MPI user does not and should not need to be concerned about it. 
However, when it comes to One Sided Communications this integration 
becomes more complicated. Some protocols, like TCP, do not lend them- 
selves to One Sided Communications, while others, like shared memory, 
are so similar that implementation is trivial. This paper describes the 
issues we came across when implementing One Sided Communications 
for an MPI implementation with multi pluggable protocols. 



1 Introduction 

MPI m has become the de facto standard for message passing in parallel comput- 
ing. Since its release in 1998 it has increasingly been adopted by both industry 
and academia. Recently its features have been extended |2| with the release of 
the “MPI-2 standard” PI • Amongst its most relevant new features is a chapter 
on single sided communications, i.e. communication that can be done without 
requiring explicit calls from all processes involved. This single sided communi- 
cations chapter is a message passing approach to shared memory. 

Implementing single sided communications for a shared memory protocol is a 
simple task since it just requires mapping MPI’s calls to the underlying system’s 
equivalents U . Most vendors have single sided communications implemented only 
for shared memory systems. Implementing shared memory on other communita- 
cion protocols like TCP is not as simple, but it has been topic for researchpj 
and there are even implementations of shared memory over traditional MPI PI . 
In terms of MPI-2 single sided implementations for non shared memory systems 
the authors are only aware of two: 0 and 0. 

However, trying to create an implementation that can cope with both at the 
same time becomes rather complicated. This paper describes the issues involved 
in developing and implementing an integration for the pluggable protocols for 
single sided communications. This work uses the SUN implementation of MPI. 
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SUN has recently made the source code for SUN-MPI available under a com- 
munity source licence 0. We begin by describing the motivations, followed by 
an overview of the solution we implemented. After that we discuss the issues 
involved in this solution. Then some benchmark results are presented. Finally 
section 1^ discusses conclusions which include ongoing and future work. 

2 Motivations 

In order to provide a complete implementation of MPI-2’s One Sided Communi- 
cations one must guarantee that an MPI call can be used by all the processes in 
MPI_C0MM_W0RLD. The current implementation from Sun only allows single sided 
communications between processes that are connected through the shared mem- 
ory protocol. This is an obvious disadvantage and does not meet one of MPI’s 
goals: portability. 

Our implementation overcomes this problem and also integrates the different 
protocols, thus lifting all previous restrictions to use single sided communica- 
tions. 

Our aim was to implement a generic version of one-sided communications 
built on top of point to point communications. Ultimately this generic version 
should be capable of co-existing with optimised protocol specific implementations 
but provide a fail-back implementation for any protocol that does not implement 
one-sided communication directly. 

One of the examples of the need to use single sided communications with 
protocols other than shared memory arises when one is intending to use clustered 
SMPs. Since shared memory cannot be used between the nodes of the cluster 
the MPI user program would have to be aware of which processes are running 
on each machine and cope with the fact that some groups of processes cannot 
use single sided communications between them. Situations like this are what the 
MPI standard proposes to overcome by making the implementor responsible for 
dealing with it. Thus it does not specify a way to obtain the information needed 
to allow the user program to cope with it. 

In the following section we present an overview of our approach to solving 
the problem. 

3 The Big Picture 

The diagram in figure Gl presents the generic single sided code within the layers 
of the implementation of MPI we used. 

Our generic implementation uses any existing protocol. Whenever the pro- 
tocol has implemented one-sided functionality this will be used, otherwise our 
implementation will cover for it. This general purpose implementation’s main 
feature is an asynchronous agent (Request Agent) which handles the RMA re- 
quests. 

In this first implementation the agent runs in a thread concurrently with the 
users’ code and the normal MPI calls. 
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Fig. 1. Single sided layers 



The call to the protocol’s implementation is done at the highest level possible 
and the overhead of the integration has been kept to the minimum. 

Overcoming the differences between the protocols and providing a way to 
cope with possibly new protocols brought to light several issues that we discuss 
next. 



4 Issues 

This section discusses the main barriers we had to overcome when integrating the 
different types of protocols. Protocols are designed to provide a certain service. 
Thus in some cases the strongest feature of one protocol can be the weakest 
point of another. 

4.1 RMA 

The remote memory access (RMA) functions only ever utilise a single imple- 
mentation so no issues arise here. All that is required is to identify the correct 
implementation based on the rank of the target process. The synchronisation 
calls are the ones most likely to give rise to conflicts. 



4.2 Fence 

Out of the three synchronisation calls only the fence is straightforward since it is 
essentially a barrier. MPI_Win_f ence completes an exposure epoch and synchro- 
nises the processors. Each implementation provides a separate function to com- 
plete their own communications, synchronisation is via the normal MPI_Barrier 



4.3 Lock 

If one uses the shared memory protocol or some protocol that provides memory 
locking then MPI’s MPI_Win_lock matches the protocol and no extra complexity 
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needs to be added. If the protocol is TCP or similar then some sort of an asyn- 
chronous agent is required to provide the lock facility. This agent will process 
all the requests from remote processes and request the system lock used by the 
shared memory protocol. 

Figure El shows the concurrency between local processes, using shared mem- 
ory, and remote processes using TCP. 
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Fig. 2. Lock synchronisation concurrency 



Because the request agent is multiplexing requests the remote locks have 
lower priority relative to true shared memory locks. 

The best solution is to implement a lock shared across protocols. This lock 
allows fairness in the lock acquisition. However it is expected that it will add 
complexity and subsequently overhead to simple protocols like shared memory. 
This solution is still under development since the implementation has to cope 
with the existing protocols and any other protocol that might be added later. 

A simple and straightforward solution has been implemented for the time 
being, resulting in remote locks being at a disadvantage against the local ones. 

There are potential deadlock situations if the request agent blocks waiting 
for the shared memory lock. 

1. The request agent needs to continue to process requests if an exclusive lock 
request cannot be granted immediately. This is essential if some of the cur- 
rent shared locks were made via the Request Agent. 

2. The request agent must not attempt to acquire a shared lock if there is an 
exclusive lock that it made for a different client. 

Therefore the request agent should never block while waiting for an exclusive 
lock. Instead it should re-send the lock request (to itself). Once the exclusive 
lock has been acquired the request agent only listens to requests from the process 
that requested the lock. Once the lock is released the Request Agent reverts to 
listening for any request. 

Exclusive lock requests are acknowledged explicitly. Because these requests 
may be read and then re-queued the originating process must wait for the ac- 
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knowledge before sending any more requests. Otherwise these requests may be 
processed before the lock is granted. 

Shared locks do not need to be acknowledged as the request (and any sub- 
sequent RMA requests) will never be received by the request agent while an 
exclusive lock is held. 

4.4 Post/Start 




This synchronisation call also needs special attention, since the call requires 
a group as a parameter. This group is a subset of the group of processes that 
are using the same window. 

Any combination of processes is possible thus all the protocols must interact 
to synchronise. However this synchronisation can be partitioned so it is composed 
by synchronising the subgroups of processes which communicate using the same 
protocol. 

Figure 0 shows an example of the Start Post synchronisation with a shared 
memory protocol and a TCP protocol. There are two groups of processes using 
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the shared memory protocol because they are running on the same machine. 
However when they need to synchronise across to the other machine the TCP 
protocol is used. 

The MPI calls are unfolded into similar calls to the underlying protocol. The 
synchronisation group is decomposed into subgroups where all the members of 
the subgroup are accessed using the same underlying protocol. Each of these 
subgroups are then synchronised in turn using the appropriate protocol. The 
MPI synchronisation call will only return when all the protocols synchronise the 
subgroups thus synchronising the main group. 

5 Performance 

The major issue in an integrating implementation such as this is performance, 
since it ought to add extra complexity with a low overhead. However the inte- 
gration is still in progress so final benchmarks could not be presented in this 
paper. This section presents some benchmarks done with a third party appli- 
cation, Pallas’ MPI benchmarks m- These were taken using a cluster of Sun 
Ultra5 workstations over a 100Mb Ethernet. All the benchmarks refer to groups 
of two processes issuing requests to each other. 

Since our generic implementation uses MPI’s point to point communication 
we choose to present a comparison between selected Pallas’ MPIl and MPI2 
benchmarks. The MPIl benchmarks chosen are the ones which have message 
patterns similar to our algorithms. 

The graph in figure EJshows the PingPong benchmark against unilateral single 
sided operations. On average all the RMAs are implemented with two messages 
exchanged between the origin and the target. The PingPong benchmark reports 
the time of a single message between two processes, i.e, half the time of a round 
trip message. 

Figure 0 shows that performance of the single sided operations is not much 
worse. One can see that the Put times for larger messages are extremely high but 
this is due to thread switching and the fact that the Put operation is implemented 
using synchronous send. 

Our Request Agent is implemented using threads, which should not have 
such a visible effect on a workstation network using 100Mb Ethernet since the 
network latency would nullify any thread context switching overhead. However 
thread support was added when MPI-1 was extended to MPI-2 and thread con- 
currency at lower layers has shown an unexpected impact on performance. Thus 
the strange peaks in the graph of figure 01 This topic is still subject to develop- 
ment. 

Figure Elpresents PingPing versus bi-directional RMAs. One can see that the 
Get was also showing the signs we saw in the previous graph. 

Benchmarks taken with more processes have shown these effects at a larger 
scale since Pallas’ benchmarks on single sided operations are done between two 
processes while all the others wait in a barrier. The processes being benchmarked 
will receive the barrier’s internal messages at the lower levels, which is suffering 
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from thread concurrency problems. These extra messages being exchanged have 
a visible effect in performance and grow with the number of extra processes. 

6 Conclusions 

An implementation of MPI that restricts the usage of the API calls does not 
meet the standard in full. On the other hand if the implementation does not 
impose restrictions but has poor performance then the users will be reluctant 
to use it. However if one can present the user with an unrestricted library that 
performs better under certain system configurations then the user will be able 
to use MPI and take advantage of the system’s features whenever possible. This 
approach is the one we followed to implement a generic single sided library that 
will use the best protocol whenever possible and revert to a generic MPI I based 
solution otherwise. 

Integration work is still undergoing as shown in the performance section. The 
main issues have been dealt with. The implementation is able to cope not only 
with the currently available protocols but also with future protocols. Thus the 
goals of this project have been achieved. 
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Abstract. The second version of the MPI-2 standard introduced new 
functionality to the Message Passing Interface. The ability of adding new 
processes to an MPI application at runtime was one of the main new extensions 
to the existing functionality. This paper presents the first implementation of the 
Process Creation and Management chapter of the standard for Win32 
environments. The implementation of this functionality in generic NT clusters 
presents challenging problems due to the distributed nature and the considerable 
difference between each cluster. A description of the problems faced while 
implementing this new functionality as well as the solutions implemented in the 
WMPI library are presented this paper. 



Introduction 

The first version of the Message Passing Interface (MPI) standard [1] was rapidly 
absorbed by the parallel computation community and became a de facto standard. 
Wishing to use MPI in a wider set of applications, the MPI users requested to the MPI 
Forum [2] to increase the functionality of the standard. The second version of the 
standard [3] was released in 1997. In fact, this second version extended the 
functionality of the standard beyond the message passing. One-sided communication, 
dynamic process creation and I/O were introduced in the standard. The new API 
allows the MPI users to create more complete and complex programs that are still 
portable. 

In this paper we present the implementation of the Process Creation and 
Management chapter of the MPI-2 standard in the WMPI library [4,5]. WMPI, which 
stands for Windows Message Passing Interface, was the first implementation of the 
standard available for Win32 environments. Originally based in the MPICH 
implementation [6], in its last version it was completely redesigned to accommodate 
new features like simultaneous multiple communication devices and thread safety [7]. 
The new internal architecture was also designed in order to incorporate the necessary 
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mechanisms to implement the dynamic creation of processes. This new functionality 
is the continuation of an effort to fully implement the MPI-2 standard in the WMPI 
library where one-sided communication [8] and extended collective operations [9] 
were already implemented. 



MPI Process Creation and Management 



In the first version of the MPI standard, the MPI Forum created a standard that 
contained only an API for message passing. How the processes started, how they 
established communication and managed resources was not addressed. The MPI users 
felt that the static environment of MPI- 1 was too restrictive. Former PVM [10] users 
found it very difficult to port their existing programs to MPI and constantly run into 
problems with the lack of process and resource management API. Furthermore some 
classes of applications (e.g. task farms, client/server applications and serial 
applications with some parallel parts) can benefit of a process control API. 

The second version of the MPI standard included a chapter for process creation and 
management. It was decided not to include an API for resource management, since no 
appropriate portable interface was found for a wide range of resource controllers. 
Although some functionality for process creation and management was introduced, 
the MPI-2 does not manage the environment were it is running, rather provides an 
interface to external process managers. 

The new functionality can be divided in two different parts: spawning new 
processes and establishing communication between two different applications. There 
is also a possibility of disconnecting processes or two joined applications. A brief 
description of the MPI-2 process creation and management capabilities is presented 
next. 



Spawning New Processes 

MPI users have two different functions to create new processes in runtime. The 
MPI_COMM_SPAWN function allows the creation of one or more processes that will 
run the same executable with the same arguments. If the users wish to run different 
executables or to pass different arguments to the several processes they should use the 
MPI_COMM_MULTIPLE_SPAWN function. 

When one MPI’s spawn function is executed, the new processes start their own 
MPI environment. The two environments are immediately connected through an inter- 
communicator and the users can immediately exchange information between all 
processes. 

The spawning operation is collective over a certain intra-communicator (a subset 
of the original processes). Only the processes that belong to the intra-communicator 
will be included in the inter-communicator (the set formed by the intra-communicator 
plus the group of newly created processes) that connects them to the new processes. A 
root process, indicated in the function’s arguments, is responsible for actually creating 
the new processes. 
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Establishing Communication between Separate Applications 

Two applications can establish communication to exchange information to 
cooperatively resolve a problem. The MPl-2 provides functions to establish 
communication in a Client/Server model. One of the applications opens a MPl port 
and waits for other applications to connect to it. The MPl port is an implementation 
dependent entry point that allows the two applications to exchange information about 
the two environments to create the inter-communicator between the applications. It is 
opaque to the user, which will simply get a port name that uniquely identifies the port. 
Using this port name the client application is able to connect to the server application 
and establish the communication. A name service is also provided, to help clients 
locate servers by name. 



Disconnecting Processes/Applications 

When two applications join, they can disconnect without having to terminate. The 
MPl users have to end every connection between the two sets of processes. A new 
function to destroy communicators was inserted in the standard: 
MPI_COMM_DISCONNECT. This function waits for the end of all pending 
communications in the communicator and ends it. 

It is also possible to disconnect and cooperatively terminate processes that were 
created at runtime. Nevertheless there are some limitations in terminating processes. 
Since each MPl spawn function creates an MPl environment (i.e. a 
MPI_COMM_WORLD communicator), a sub-set of the processes that were created with 
a single spawn function cannot be terminated separately from the others. 



MPI-2 Dynamic Environment 

In a first overview it might seem that the MPl-2 environment is not really dynamic, 
because it maintains the new processes connected through an umbilical cord (the 
inter-communicator) instead of completely merging them. MPl users may find odd 
the utilization of an inter-communicator instead of an intra-communicator. Although 
the collective operations have been extended to embrace inter-communicators, they 
still have different syntax due to the existence of two separated groups. 

However, the MPl users can create an intra-communicator that brings together all 
processes in a single group by using the MPI_INTERCOMM_MERGE function. This 
way it is possible to completely merge all the processes in a single communicator 
(though not the MPI_COMM_WORLD). Once the processes can be joined in intra- 
communicators, the MPl-2 environment is similar to the MPl-1, but dynamic. The 
MPl-1 becomes a special case of the MPl-2 environment. 

The ability of transforming the inter-communicators into intra-communicators 
allows the users to easily port their static applications into an MPl-2 dynamic 
environment. Moreover, it enables the users to make any combination of processes 
from several different joined applications and spawned processes by using the MPl 
group and communicator handling functions that are included in the MPl-1 version. 
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The dynamics of the MPI-2 environment is only shadowed by the decision of not 
changing the concept of communicator introduced in the MPI- 1 version. Once created 
it cannot be changed. This implies that is not possible to remove one single process 
from one communicator while the others remain communicating. Moreover the 
MPI_COMM_WORLD remains indestructible. Hence is not possible to terminate a 
subset of the processes created with a single spawn function or of an application. The 
termination of processes without ending the whole application is only possible when 
all the processes of their MPI_COMM_WORLD agree to end. 



Process Identification in a Dynamic Environment 

The implementation of a dynamic MPI environment imposes considerable changes to 
most of the existing libraries. As many of the implementations available worldwide, 
WMPI inherited from MPICH an architecture well adapted for a static environment, 
but completely inadequate to cope with a dynamic environments. It bases the 
identification of the processes at low level in their rank in the MPI_COMM_WORLD 
communicator. Since in MPI-1 every single process of the application belongs to the 
same MPI_COMM_WORLD communicator, the ranks in this communicator uniquely 
identifies every process. However, in MPI-2 dynamic environment there are several 
MPI_COMM_WORLD communicators that can co-exist in the same MPI runtime 
environment through the execution of MPI spawn functions and the joining of 
different applications. In this cases the simple MPI_COMM_WORLD rank is not a 
unique identifier. 

It is thus necessary to create an identification form that enables to uniquely 
represent a process. The system process number would be a good identifier in an MPP 
(Massively Parallel Processor), however in a distributed environment each node has 
an independent operating system, hence two processes of a cluster can have the same 
process number. In a generic cluster environment the data exchange can be done by 
several different communication media, depending on the cluster configuration. An 
MPI implementation for such an environment has always to be aware of the 
processes’ addresses in the different communication media used. Since these 
addresses uniquely represent the process in the communication medium environment 
(hence the whole cluster), they are used in WMPI to represent the MPI processes 
inside the library. 

Each WMPI process contains information about every other process to which it is 
connected. The information is placed in a WMPI process object that is kept in 
memory while there is at least one common group between the two MPI processes. 
When a process joins a new inter-communicator it compares the addresses of the new 
processes with the addresses of the processes to which it already has a connection. If 
it finds a process object containing the same address in the same communication 
medium it knows that it is the same process. In this situation is not necessary to create 
another process object, it just increments the number of references to the object. 
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Spawning Processes in a Distributed System 

When an MPI library running on a generic cluster has to spawn new processes it does 
not know in which machine they should be created. In some systems like MPPs a 
queuing system able of taking those decisions may be available, in many others not. 
The MPI Forum was conscious that some environments would require specific 
information to enable that decision. Hence an extra argument that may contain 
environment specific information was introduced in the spawn functions. The 
argument may contain several key-value pairs that the implementation should 
interpret. Not all the MPI implementations have to consider this argument when 
executing the functions. In addition, different MPI implementations may require 
different information, hence different keys. Hence, the usage of such argument 
reduces the portability of the code. A set of keys reserved by the standard has a 
specific functionality. This was an effort to minimize the portability problems in using 
this extra information. However, the MPI implementations are free to interpret these 
keys or not. 

If no extra information is provided to the WMPI implementation when processes 
are spawned they will be started in the same machine where the operation’s root 
process is running. Nevertheless, WMPI can interpret two of the reserved keys to 
allow the user to specify where the new processes should run: 

• host: The user specifies the name of the machine where all the processes should 
be started. 

• file: The user specifies a process group (PG) file that contains all the information 
about the processes that should be started in the spawning operation. The PG file 
has the same structure as the one used to launch normal WMPI applications. This 
file indicates the machine where each process should be executed. The file also 
identifies the executable and the arguments of the processes 



Performance Results 



The spawn operation 
implies the 

execution of many 
heavy system calls, 
such as creation of 
processes. More- 
over, the newly 
created processes 
have to synchronize 
and exchange 

information to setup 

a new iviri environment ana connect lo trie breeder group of processes. The results 
presented in this section show the elapsed time to spawn new processes. To get the 
performance results we used two dual Pentium Pro machines running Windows NT 
operating system. The machines were interconnected using a non-dedicated Fast 




Figure 1. Spawn in a single machine. 
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Ethernet network. The time to create a process in any of the machines is 
approximately 25 milliseconds. 

Figure 1 presents the time in seconds to create up to four processes in a local 
machine and well as a remote machine. The time to create a remote process is around 
150 milliseconds slower because it is necessary to contact the remote machine’s 
service and request for the creation of the first process. 

Using a third machine (Pentium II 233 MHz), processes were started in both 
machines simultaneously. Figure 2 presents the time in seconds to create the up to 
four processes per machine (eight processes). In this case, the spawn time rises 

because it is 
necessary to create 
processes in two 
different machines. 
In addition, the 
created processes 
have to interact 
between each other 
to setup the 
environment and 
connect to the creator 
process. 




Connection Establishment 

There are many interconnection solutions for Windows clusters. It is impossible to 
predict, when designing the MPI library, all the possible topologies and 
interconnection media. 

To solve the problem of choosing the best communication medium for each pair of 
processes, WMPI requires the user to describe the whole cluster in a file, called 
cluster configuration file, which contains the information about all the machines that 
might be involved with the application. The file indicates in a simple way the 
interconnection technologies available, and which communication protocol should be 
used for communication between each pair of processes. 

An additional problem occurs when two applications join (MPI-2 allows two 
independently launched applications to join at any time), because the first message 
can come from anywhere in the cluster, and the participants in that communication 
must agree on the protocol to use. Since TCP/IP is present in practically all 
environments, WMPI uses it to establish the first connection and exchange the initial 
information, which includes information about the machines where the processes of 
both sides are i^unning. Together with the cluster configuration file, each process can 
then decide which communication medium to use to contact the other processes at the 
highest possible speed. 
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WMPI Name Publishing Service 

The MPI-2 standard introduced a name publishing service, aimed at enabling MPI 
applications to easily find the port name where to contact an MPI server application. 
This functionality has the added value of allowing some MPI applications to work as 
location transparent service applications. It must be reachable by the server 
applications (to publish a new name) and by the client applications (to inquire the port 
name of a service). Once again the implementation of this name serve is quite 
dependent on the environment of the MPI implementation. In an MPP where MPI 
implementations are almost always proprietary, the name server might be a process 
that is waiting on some well-known message queue. However, in a generic distributed 
environment it can be running on almost any machine and it is impossible for the MPI 
library to know where the name service is without the user’s help. 

In the case of WMPI the name server is a separated application that must be 
running in some machine, whether it is one of the cluster’s machines or not. It is only 
reachable through a TCP/IP connection. Since the TCP/IP protocol is very common in 
such environments, this decision does not restrict the usability. The MPI process that 
is publishing the name service or inquiring for a port name of a service can get the 
name of the machine, as well as the wanting port, where the application is running 
from an environment variable that must be set by the user. Alternatively the user can 
pass to the WMPI library the name of the machine that should be contacted through 
the additional information parameter available in the calls to the name service. 



Related Work 

Two other implementations we know of have this MPI-2 functionality implemented. 
LAM/MPI implementation from Notre Dame University [11] implements all the 
MPI-2 process creation and management functionality. This implementation runs over 
Unix based systems and is a freely available implementation. 

Fujitsu has a full MPI-2 implementation [12] that runs over its proprietary message 
passing kernel MPlib. This implementation uses the capabilities of the MPLib, which 
was specially developed for Fujitsu systems, to dynamically spawn new processes. 

To the authors best knowledge no other implementation for Win32 environments is 
available at this time that implements MPI-2 process creation and management 
functionality. 



Conclusions 

MPI rapidly became widely used by the high performance community. The MPI users 
early started to request for more functionality in the MPI standard. One of the 
requirements was the ability of dynamically add processes to the application. The 
experience with the PVM library similar functionality helped the MPI Forum to create 
an interface to dynamically spawn new processes. The introduced interface allows the 
MPI users to work with a truly dynamic environment where processes can be created 
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and destroyed at runtime as well as applications can join and disconnect without 
having to finalize. 

Although many MPI users required this functionality, its implementation demands 
a considerable change in the existing MPI-1 implementations. The existence of many 
MPI COMM WORLD communicators removed the possibility of identifying the 
application processes through their MPI COMM WORLD’s rank, a widely used 
technique. Cluster based implementations have additional problems due to the 
distributed and generic nature of the environment. To overcome these problems, the 
WMPI library suffered deep changes in its internal design. The process identification 
is made through the process’s communication medium addresses. To allow the newly 
connected processes to communicate through the fastest communication path a cluster 
configuration file was introduced. This enabled to produce a library that can run over 
any cluster and to take full advantage of its capabilities. 

Being the first implementation of MPI-2 process creation and management chapter 
in Win32 cluster environments, this evolution of the WMPI library also shows that it 
is viable to provide that functionality. 
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Abstract. Ensemble has been proposed as a methodology for designing and 
implementing message passing applications by composition of modular and 
reusable message passing components. In this paper we adapt Ensemble as a 
mechanism for composing message passing applications in a meta-computing 
context on demand. Ensemble is particularly effective in the case where users 
demand different process topologies to be created out of the same components. 
We demonstrate this case by an application from transaction processing and in 
particular parallel query execution based on the tree pipelining model. 



1. Introduction 

We have developed the Ensemble methodology [1,2, 3, 4] for designing and 
implementing message passing applications based on the composition of modular and 
reusable message passing (MP) components. In addition to the general benefits of 
modular design. Ensemble aims to overcome three problems in the design and 
implementation of MP applications and in particular those requiring irregular process 
topologies. 

The first is that implementation does not only depend on the application design, but 
also on the target MP Library (MPL), mainly because of the process management 
model each MPL adopts. For this reason, some topologies are easier to establish than 
others on specific MPLs. For example, it is easy to create tree topologies (regular or 
irregular) in PVM [5] and regular ring or grid topologies in MPI [8], but more 
difficult the other way round. Topologies not well suited to an MPL may certainly be 
created, but require specialized programming. 

The second problem is that MPLs support regular but not irregular or partially 
regular topologies. Process topologies are programmed within processes by 
specifying implicit communication channels, expressed by symmetric calls of send 
and receive operations. For regular topologies the designer develops functions, which 
take a process identifier and return the identifiers of its communicating processes. 
These functions are usually parameterized to return the identifiers of processes in any 
size of the regular topology. For topologies, which are not globally regular but only 
partially or locally regular or even altogether irregular, general functions cannot be 
derived and consequently ad hoc programming methods are used. 

The third problem is that reuse of message passing components is limited. The task 
a designer of a message passing program faces is to express in a source program 
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(component) the appropriate interactions of all the processes, which will be spawned 
from it, considering all their possible positions in the topology and additionally, for 
any size of the topology. Reusability is limited by the use of specific process 
identifiers (tid or rank) in communication channels or by the functions determining 
them, which assume specific regular topologies. 

The heart of the Ensemble methodology is the design of modular reusable message 
passing components, which may be used in any topology, whether regular, partially 
regular or irregular. We have defined simple generic process communication 
interfaces independent of any topology, which processes use in the parameters of 
point-to-point and group operations. Topologies are composed by a loader program, 
which binds process communication interfaces at run time, directed by composition 
scripts. The components, the scripts, the loader and utility libraries comprise the 
Ensemble Software Architecture. 

In this paper we adapt the use of Ensemble Software Architecture (ESA) as an 
infrastructure for composing message passing applications in a meta-computing 
context on demand. ESA is particularly effective in the case where different process 
topologies need to be created from the same components according to user demand. 
We demonstrate this case by an application from transaction processing and in 
particular parallel query execution based on the tree pipelining model [10,12]. 

The structure of the paper is as follows: In section 2 we present the Ensemble 
Software Architecture, in section 3 we adapt and demonstrate the use of ESA in a 
meta-computing context; finally in section 4 we present our conclusions. 



2, The Ensemble Software Architecture 

Ensemble specifies a software architecture (figure 1) common for all MP 
applications in any MPL. The design of a message passing application is maintained 
in the implementation, which is an “ensemble” of reusable executable program 
components and of composition directives (scripts). 

The script specifies the application processes (to be spawned from the reusable 
executable program components) and their topology (or Process Communication 
Graph-PCG) independent of any execution environment (MPL or architecture). The 
script also specifies the allocation of resources of the execution environment 
(mapping of processes to processors, input and output files, etc.). 

The source programs are designed to be independent of any MPL. By compiling 
and linking with the appropriate MPL and Ensemble libraries the reusable executables 
(within the specific architecture and MPL) are obtained. 

A loader program, universal within an MPL, interprets the script and establishes 
the topology by creating processes and by setting communication channels. Instead of 
functions associating processes to their position in a topology, the topology (regular 
or irregular) is composed directly by binding communication ports of the spawned 
processes. The loader performs all process and resource management, as specified in 
the script. 
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Fig. 1. The Ensemble Software Architecture 



2.1 The reusable components 

The Ensemble components do not involve any process management nor assume any 
specific topology in which they operate. Instead they support an unbound interface of 
communication ports. Ports are implemented as structures holding information of the 
receiver or the sender process, used respectively by send and receive operations. In 
PVM3.3 for example, a port is defined as a structure struct port_struct= { int 
tid, msgtag ; } , which holds pairs of (tid, msgtag) the parameters denoting message 
destination in send and origin in receive operations, respectively. Ports of the same 
type form arrays, the number of which is a variable in general and has to be 
dynamically allocated, defined as a structure struct port_type {int 
portcount; port_struct *port;} which is a header element with two fields, 
one for holding the actual number of ports (portcount) and a pointer to the actual array 
of ports of the process to which it belongs. Note that the above declaration is 
independent of any specific MPL. The differences are hidden in the declaration of a 
single port. The interface is an array of port_types, defined as struct *port_type 
Interface ; 

In the body of the components all send and receive operations in processes refer to 
ports, identified by the appropriate port type and the port index within the type. In 
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PVM for example, tid and msgtag parameters of pvm_send and pvm_recv routines 
should refer to tid and msgtag of a port P of type T, i.e. Interface[T].port[P].tid and 
Interface[T].port[P]. msgtag, respectively. Alternatively, the programmer has the 
option to use abstract communication functions which hide pvm_send and pvm_recv 
calls within wrapper routines, such as send (T, P, What) and respectively 
receive (T, P, What) , to indicate that message What is sent to, and respectively, 
received from port P of type T. The same wrapper routines may call MPI routines if 
the target MPL is MPI, thus hiding all differences and maintaining one source code of 
components for any MPL. Ports are unbound at compile time as they do not have any 
values. This way process interfaces are open and scalable and the code that uses them 
reusable. 

A common Main for all program components, which may be seen below has been 
developed. The application computations for each component are coded in 
RealMainO function, which accepts the same parameters as function main() in 
regular C programs, as well as the Interface structure. 

void main (argc, argv) int argc; char **argc; 

{ MakePorts ( Interface) ; 

Setinterface (Interface) ; 

RealMain ( Interface , argc , argv) ; } 

Interface contains all the necessary interfacing information, i.e. the actual number of 
ports for each port type and is passed by the loader program as command line 
parameters, and values for port fields. The first call is MakePorts (Interface) , 
which reads these parameters and allocates space for the appropriate number of ports 
and sets the portcount field to the appropriate value for each port type in array 
Interface. Processes set actual values to their interface by executing a routine, 
Setinterface ( Interface) , which must coordinate with the loader. Each MPL 
requires its own implementation of the Setinterface routine. In MPI, as process 
identification (rank) is known at compile time, it is possible to pass it in their 
command line parameters. In PVM, where process identifiers (tid) are dynamically 
determined, it is not possible in general to pass to a process at the time of its spawning 
the tids of its neighbours, as these may have not been spawned yet. In PVM the loader 
program sends messages with the tids of the neighbouring processes. Symmetrically, 
the Setinterface receives these messages and updates the interface. 

Finally RealMain ( ) is called, which performs the actual application computations. 



2.2 The Ensemble script 

The script is structured in two main parts. The first part abstractly specifies the 
Process Communication Graph (PCG) of the application, independently of any MPL 
or underlying architecture. Here we specify abstractions of the components involved; 
the processes to be spawned from each component together with their interface 
parameters; and the communication channels between ports. The second part of the 
script specifies the resource allocation of the execution environment, process 
parameters required in the application and information required by specific MPLs and 
the underlying architectures. 
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2.3 Composition of applications 

The Loader program interprets the script, composes the message passing application 
by spawning processes from component executables and establishing their 
communication channels, relieving the programmer of a complex task. There is a 
universal loader program for all applications in a given MPL. Ensemble tools for 
PVM [1,3], Parix [2] and MPI [4] have been developed. 



3. Ensemble Software Architecture in Meta-computing 



ESA may be used in a meta-computing context to compose applications on demand 
(figure 2). We assume that the service provider has developed the reusable program 
components. 
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Fig. 2. Using the Ensemble Architecture for composing Applications on Demand 

An application server accepts and manages user requests of program executions and 
from these it produces the PCG part of the script of this request. The application 
server may consult local information (managed by the service provider) about the 
availability, policies and permissions of resource allocation and completes the 
application script, which is then passed in the Ensemble environment. Now the 
application may be composed and executed as explained in section 2. 

In the case of regular SPMD applications topology configuration requires simpler 
parameters and the use of ensemble scripts are obviously excessive and do not 
provide any real benefit. But for the class of applications, which need to be 
configured on-demand, it may prove valuable. 
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In the next section we demonstrate the above general ideas on a specific 
application in the domain of transaction processing and in particular parallel SQL 
query execution. 



3.1. Parallel Query Execution on Demand 

We first describe the basic model for query execution and then present the 
implementation architecture. 



3.1.1 The Parallel Query Execution Model: Tree Pipelining 

In the Tree Pipelining query execution Model (TPM) [10,1 1,12] an SQL query with a 
large number of joins, restrictions projections, set operations and aggregates is 
transformed into a query tree, representing its parallel execution plan (QEP). The 
query tree is built by a preoptimizer, which performs “unnesting” of nested queries 
[7], transforms disjunctions into unions and places selection nodes below join nodes 
on the same relations. Set operations are placed below the root projection and 
separated from the nodes containing predicates by further projections. So, projections, 
selections, and joins form “PSJ-zone”, in which joins are gathered in a “JOIN-zone”. 
Similarly, set operators are gathered in a zone directly below the root. A 
parser/preoptimizer producing this tree structure is presented in [1 1]. 

This query tree may be used as the Process Communication Graph in Ensemble to 
compose a message passing application. The nodes of the tree represent relational and 
set algebra operators and the arcs represent communication of results from children to 
parent nodes. The query tree is bushy [6], so that nodes in different subtrees can be 
executed in parallel, while adjacent nodes can be executed in pipeline. The intrinsic 
parallelism of the tree representation is thus fully ufilized: (i) processes on all leaf- 
processors may sfart execution immediately, as they have the base relations available, 
producing tuples of intermediate relations and propagating them to their parent nodes 
and (ii) processes on inner processor-nodes start execution in pipeline mode as soon 
as they have sufficient tuples to operate on. 

The initial query tree may also be optimized in a parallel way [12]. The optimizer 
minimizes the execution time of the query by minimizing I/O and communication 
between processes. It does not take into account processor utilization. A cost model 
for the estimation of communication and I/O costs in parallel execution spaces 
according to TPM has been used in exhaustive parallel optimization and in 
parallelized enhanced iterative improvement. 



3.2.2 Parallel Execution of Queries 

In such parallel execution scheme, SQL queries demand the creation and execution of 
distinct “applications”. However, these “applications” are all composed out of the 
same basic components, namely implementations of basic relational algebra 
primitives, select, project and join and of set operators. We have developed a suite of 
these primitives as Ensemble reusable components, which may be used to compose 
any required parallel program as directed by the (optimized) query tree. For the 
execution of joins we considered the nested loops algorithm and the merge join for 
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sorted input. For equijoins, antijoins and equality outer] oins we also consider a hash 
algorithm. For restrictions we used two algorithms for sorted and unsorted restriction 
attributes. For projections we use a merge sort method. The execution algorithms do 
not support indexing. 

In figure 3, we depict the parallel execution of a SQL query on demand. The user 
sends its SQL query. The Query Server reads the query as well as appropriate 
information concerning the System Services, the DB schema and the File System, and 
constructs the initial query tree, the Process Communication Graph. If the initial 
query tree is small it may be given directly for composition and execution; otherwise 
it may be optimised before doing so. 

Then, the loader interprets the (optimised) query tree and using the reusable 
executables from the component pool, creates the processes and establishes the 
appropriate communication channels in the execution environment. 




Fig. 3. Parallel Execution of SQL Queries on demand 

The Query server as presented above may be relieved of certain duties. A user 
agent may perform the SQL to query tree transformation locally and send the initial 
query tree. Concerning the optimization phase there are some options available. The 
local agent may also optimize the query tree and send an optimized query tree. In this 
case the Query server is only concerned with the allocation of resources according to 
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its availability and policies. Alternatively, the Query Server may initiate the 
optimization of the initial query tree as a parallel application and may also be 
concerned with processor utilization and load balancing issues. We leave these issues 
open, as in the context of this paper we are mainly concerned with the underlying 
mechanisms of the on demand composition of message passing applications. 



5. Conclusion 

In this paper we adapt the Ensemble Software Architecture (ESA) as an infrastructure 
for composing message passing applications in a meta-computing context on demand. 
ESA is particularly effective in the case where different process topologies need to be 
created from the same components according to user demand. 

We demonstrated the issues involved on parallel query execution, where all queries 
(“applications”) are composed out of the same relational algebra and set operators and 
users do not need to know anything about their execution; as far as they are concerned 
they submit queries and receive the results. All execution aspects are transparent. 
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Abstract. In this paper, we present a communication library which extends an 
MPI application on a single parallel machine to a cluster of parallel machines. 
Stampi provides some functionality which are required for constructing distrib- 
uted applications and environments based on the MPI2 standard with a focus on 
dynamic process management. Since the mechanism of communication bridge is 
transparent for users, it is very useful to assemble and link MPI applications on 
meta-computer systems. Furthermore Stampi supports novel functions; one is the 
communication between a Java applet to the backend parallel computer. Another 
is supporting remote file-IO. Both give us a framework of distributed resource 
management based on an MPI communication infrastructure. This paper covers 
the architecture of Stampi. 



1 Introduction 

Recent progress of computer and network technology allows us a high-speed and large 
scale scientific simulation. By using several parallel computers, we can treat a huge 
problem heretofore impossible. Such computing, called metacomputing, is greatly de- 
sired from the realm of computational science and engineering. To construct the envi- 
ronment in which the users handle such computing with ease, it is significant to support 
seamless use for any users (here ‘seamless’ has various meanings). One can easily imag- 
ine communication is one of core parts for such environment, and it should be highly 
developed and flexible to support various kinds of services. 

Users need many existing applications to be processed on such a seamless environ- 
ment and to inherit their know-how for distributed computing. From the standpoint of 
scientific computing, both speed-up and scale-up are indispensable for the developed 
code so as to gain a profit of distributed computing. In addition cost-performance with 
regard to running or porting is an important factor for scientists who are not experts 
in this area. Fortunately almost existing applications are developed in consideration of 
portability on various platforms, and ported with MPICEI. This means that if common 
communication layer and process management system work over distributed machines, 
the MPI applications also run on a virtual parallel computer system in which each ma- 
chine becomes a computational element. Thus we recognized that utilizing the MPI 
standard is the best way for distributed parallel computing, and adopted it as a commu- 
nication infrastructure in our metacomputing environment. This was the first objective 
of developing Stampi. 
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Many efforts have been done in the implementations of MPI which extend func- 
tionality for heterogeneous computing. PACX-MPI01, MPICH-G|<J| MPLConnect[0| 
and LAM(version 6.4)P5^] are well-known, and many experiments of distributed com- 
puting are reported. Especially PACX-MPI was developed as a part of global wide- 
area application testbed, and it demonstrated transatlantic computing in SC97, 98 and 
99. MPICH-G is one communication infrastructure in Gridware[2j, which integrates 
MPICH and globus toolkit, nexus. Both intend to process global or wide are comput- 
ing spread across several countries, and support static process management and net- 
work routing. Stampi[^]| was originally developed on local-area network, wherein sev- 
eral small parallel computers work.We are going to expand this to wide-area network. 
Stampi supports from dynamic process management to MPI-IO[9i| for effective use of 
limited computational resources in a LAN. 



2 Outline of Stampi ’s Features 

The basic concept of Stampi is to remove a barrier of heterogeneity in communication 
in distributed computing, and to provide all communication functionality in the MPI 
semantics. We focused on utilizing applications developed in the MPI2 standard with 
minimal modification. Stampi realizes distributed computing in which a virtual super- 
computer comprises any parallel computer connected via LAN or WAN. The minimum 
configuration of Stampi is illustrated in FiglU Stampi library is linked to the user’s par- 
allel application, the vendor’s supplied MPI library and message routers bridging the 
gap between two user applications on the different machines. Here we assume that par- 
allel machine A and B are managed in interactive and batch(NQS) mode respectively. 
In addition to the basic configuration shown in LiglU it is assumed that NQS commands 
(qsub, qstat, etc.) are available on only the frontend node of machine B. All nodes in a 
cabinet including the frontend are IP-reachable on the private address and separated in 
global IPs. 

To establish real distribute computing, we should introduce common rules to share 
the distributed heterogeneous resources. They involve to remove or hide the difference 
in various version of software or handling of data and application caused by hetero- 
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geneity. In the above mentioned configurations, we considered communication layer, 
segmenting in private addresses, execution mode(interactive and batch), handling of 
remote processes, compile and link (commands, options etc.), data formats and so on. 

Achieving distributed executables without modification to the MPI applications, 
Stampi uses a profiling interface shift in its implementation. Therefore users only need 
to recompile their codes before they run the application. Stampi manages hosts and the 
local rank internally by Stampi-communicator, and when it detects a communication, 
it chooses the best way of inter- or intra- communication mechanism. In the current 
implementation, Stampi uses a TCP-socket for inter-communication and the vendor 
supplied communication mechanism for intra-communication. Introducing such a hier- 
archal mechanism lighten the disadvantages in a common usage of communication. 

Common data representation is also considerably most important in a heteroge- 
neous environment. Stampi adopted external32 format defined in the MPI2. Stampi 
hooks a converting procedure on the subroutines, MPI_Pack and MPI_Unpack, also in 
the external communication layer. 

Other term, especially the handling of remote processes is significant, and it was 
realized by introducing a router process and new API, which are described in the next 
section. Although optimization of routing and control of load balance are indispensable 
for distributed computing, they are located on the upper layer of Stampi and we do not 
exploit them in this paper (please see the related work fril V 

3 Technical View of the Stampi Library 

3.1 Router Process 

The message router processes are routing from internal-private IP address to the global 
IP address (and other way round). Though VPN and NAT are effective ways to re- 
solving private networks, introducing them on parallel super computer has problems in 
management and supports. Because of it, we introduced the router process in Stampi. 

The router process runs on the frontend node, and makes connection between the 
internal nodes and the counterpart. Since some parallel computers MPI applications 
cannot run on the frontend, it was developed as non-MPl application. Another function 
of the router is starting-up a remote MPI application. Since a startup requirement of 
the remote application cannot arrive at the remote machine directly, the router issues a 
remote procedure call. 

To avoid the disturbance of introducing router processes, users can change the num- 
ber of routers statically in the current version of Stampi. For the case that all nodes 
in parallel computer have global IP addressing, the router process is not required and 
all MPI processes can communicate directly. If one parallel machine has global IP and 
another has private IPs, one router will be created on the frontend node of the system 
with private IPs. Moreover, if parallel computers have multi network interface cards 
(NIC), the user can create several routers on nodes which have NICs and will take care 
of parallel network routing (Fig. II. 

Next we would like to present performance of the router. Since redundant memory 
copies must occur in the internal process, throughput becomes no more than the har- 
monical average of that in the connected networks. From a preliminary measurements. 
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Fig. 2. Examples of router configuration (one router and multi-network routing) 



sustained performance did not exceed the estimation (see Tablelfli. Here, a Fujitsu VPP- 
300 requires a router process and an SGI ONYX and as NEC SX-4 need no routers, and 
all machines were connected via HIPPI (peak lOOMB/s). For the case in the long mes- 
sage, Stampi performed about 94% (at VPP inside-VPP frontend-ONYX) and 95% (at 
SX-ONYX) of the raw TCP-socket, therefore the loss of introducing the router was at 
least 1% of the communication performance. These results show that Stampi fully uses 
the performance of TCP-socket and it is an effective tool in a large scientific application. 
On the contrary for the short message, it reached quite a half performance of the estima- 
tion derived from the sustained throughput. It can be interpreted that; communication 
overheads such as checking header, protocols and data conversion became outstanding 
in the short communications, but it was only a few millisecond per a Stampi packet and 
it was neglectable in the long communications. 



Table 1. The performance results in a ping pong communication 





latency 


throughput [MB/s] 






length of message [Byte] 




[ms] 


8000 


o 

* 

CO 


8*10® 


8*10® 


VPP frontend-ONYX(raw TCP) 


1.71 


3.49 


13.3 


14.6 


14.7 


VPP frontend-VPP inside(raw TCP) 


1.00 


6.05 


26.7 


39.2 


40.9 


VPP inside-VPP frontend-ONYX(estimation) 


2.71 


2.21 


8.87 


10.6 


10.8 


VPP inside-VPP frontend-ONYX(Stampi; one router) 


5.89 


1.45 


5.67 


9.34 


10.2 


SX-ONYX(raw TCP) 


0.92 


6.75 


20.3 


20.7 


21.2 


SX-ONYX(Stampi; no routers) 


2.60 


3.54 


14.3 


19.0 


20.2 



3.2 Dynamic Process Management 

Dynamic process management is one of the most important function of Stampi. This 
function and APIs were adopted in MPI2 and the user can write a manager-worker 
type process creation with calling MPI_Comm_spawn or MPI_Comm_spawn_multiple 
functions. In the semantics of MPI2 standard, these functions create new child MPI 
processes on the specified host and establishes a connection between them. A new inter- 
communicator is obtained and it enables inter- communication. Thus this functionality 
provides an ability to rebuild a virtual-computer world at run-time with user’s intention. 
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frontend node frontend node 




Fig. 3. Spawn operation under the batch mode 



As described in the last subsection, this function requires a message router for cre- 
ating a remote process. The router relays start-up commands and information, like con- 
nected network, allowed socket number, program name, user-id, the number of proces- 
sors to be used, working directory. These specification are given by setting an MPI jnfo 
object, which lists a pair of information (key, value). Stamp! originates some infokey, 
user id, node, partition, batch queue name and so on. Furthermore the router issues a 
remote program with help of a remote shell command. 

Stamp! supports both interactive and batch mode. Though batch requests of other 
users violates the connection between batch and interactive jobs, it was introduced for 
common usage of both jobs in heterogeneous environment, wherein batch job is only 
permitted. Fig|5| shows the sequence of the remote process creation using the batch 
mode. When user code calls the MPI_Comm_spawn function, the router process starts 
a Stampi-startup command (starter) and generates a script file which is submitted to a 
batch queue. Next the starter written in the script file kicks off the user MPI application 
and the router process. Finally connection between machine A and B is established, 
and ACK returns. The interactive mode is simpler than the batch mode; initiation com- 
mand starts the remote application and forks the router directly. All other operations are 
similar as in the batch mode. 

A client-server type connection is also supported; MPI_0pen_port, MPI_Comm_ 
connect, MPI_Comm_accept. This type of connection supports constructing client- 
server applications. In addition, it supports to establish the complete connection of 
more than three applications, because the spawn function only provides the connec- 
tion between parent and children and there is no way to make a connection in the same 
generation or between child and grandparent. Since its mechanism is very similar to 
TCP-socket model, its detailed implementation is omitted. 

The dynamic process creation requires users to modify or insert the functions into 
their codes so as to control a remote MPI application. Stamp! also has a command-line 
option whereby Stamp! spawns the distributed applications and make one MPI_C0MM 
_W0RLD when the MPI_Inits complete. For example next command: 

‘jmpirun -np 1 foo : -np 4 -host B -nqsq classC foo’ 
initiates up a process (foo) on localhost and 4 processes (foo) on machine B with an 
NQS submit command and creates a united MPI_C0MM_W0RLD. From the viewpoint of 
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application user, this option provides a flat MPI world spread over parallel machines, 
and it clears the way for increasing the number of processors. But we would like to point 
out here that it only removes a communication barrier of distributed computer resources. 
It will give rise to a penalty in computing because of problems of load imbalance and 
varying inter-communication speed. If users expect better performance, they should be 
considered investing some effort in tuning their code for the distributed environment. 



3.3 MPI I/O 

MPI I/O is another significant feature of the present Stampi. Stampi provides distributed 
I/O, which is similar to ROMIO0, and this function removes a manual file transfer and 
temporary disk spaces for handling the large amount of data. The user initially enters 
the function, MPI_File_Dpen, and then Stampi creates an I/O server on the specified 
machine. The basic architecture of MPI-I/O is shown in Fig. E] The I/O server acts as 
an MPI application and talks with the client processes about I/O operations. A typical 
I/O operation corresponding to an MPI_File_read follows the next procedures. (This 
function of MPI2 standard is using blocking data access with individual file pointers.) 

1. send a read request, ‘(read command, data type, count, file position, ...)’, 

2. prepare a buffer area on tbe lO-server, 

3. read data from remote disks, 

4. return a status code of data read by tbe lO-server, and finally 

5. if status is ‘SLfCCESS’, then data transfer to the client. 

In this read operation, firstly the client sends requests of file manipulation which is then 
processed by the server on its local disk. As shown in the figure, implementation of 
MPI-IO is based on 3-way connection process. One might feel that it is a costly design, 
but data transfer of mage-byte order will hide these overheads and assurance of buffer 
area is important in the real I/O operation. Other operations, such as MPI_File_sync, 
collective access (MPI_File_read_ all and MPl_File_write_all)and nonblocking 




call MPI_lnfo_create(info, err) 

call MPI_lnfo_set(info, "host", "data_server", err) 

call MPI_Open_File(MPLCOMM_WORLD, & ^ 

’data.file’, MPI_MODE_RDONLY. info fh, err) V ^ l^OLlter 

call MPI_File_read(fh, buf, n, MPI_INTEGER, st, err) . Scampi Starter 

call MPI_Get_count(st, MPIJNTEGER. cnt, err) ^ ‘ ^ 



Fig. 4. The architecture of MPl-IO in Stampi (descriptions in this view present a read 
operation) 
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versions, are processed in a similar way. As described in the previous subsection, user 
application can talk with the 10-server directly if direct communication is available. 

3.4 Stampi/Java 

Java is a portable programming language which is available in a wide range of plat- 
forms. Many users hope to access metacomputing resources from their local termi- 
nal with user-friendly interface like the Weh browser. Stampi/Java provides function 
to connect MPI applications (developed in Fortran or C) and Java applets. This exten- 
sion is not a part of the MPI-standard, but similar implementations were proposed fTDi . 
Stampi/Java implementation supports the functionality of point-to-point communica- 
tion, process creation and client-server connection to the backend machines. Relaying a 
message between Java-socket layer and Stampi-communication is realized by introduc- 
ing a Stampi/Java server (SJ), which runs on the web server. SJ talks with both applets 
and Stampi-application, and relays the received messages. The architecture is presented 
in Fig.Q In the Stampi class layer, message objects are marshaled and translated to the 
intermediate format. 

A preliminary test was carried out using the following conhgurations; the Web 
server, the web client and the backend supercomputer are a Sun enterprise, a note- 
PC and an SGI ONYX2 respectively, and all connected with 100Base-TX. Latency was 
less than 1 millisecond and throughput reached 530KB/sec. The result implies that java 
applets cannot perform high speed communication between the supercomputers. But it 
can conduct a program controller, whereby users determine a dynamic assignment of 
resources in a metacomputing environment. 



frontend node 




Fig. 5. The architecture of Stampi/Java 



4 Summary 

We have presented an outline and architecture of Stampi, which provides some ex- 
tension to vendor MPI libraries and the world of distributed parallel computing. Im- 
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plementing dynamic process control, MPI-IO and bridging to Java platform extend its 
usability, and we believe that it will contribute towards the progress of metacomputing. 

Currently Stampi is ported on several platforms, for example MPPs (Hitachi 
SR2201, IBM SP, Intel Paragon, Fujitsu AP3000, etc.), vector parallel comput- 
ers (Fujitsu VPP300, NEC SX-4 and Cray T90), SMP servers (SGI Origin, etc.), 
WS / PC cluster (Solaris, HP, Alpha, Linux and FreeBSD) and so on. We in- 
stalled Stampi on a parallel computer cluster, COMPACS (COMplex PArallel Com- 
puter System), introduced in the Japan Atomic Energy Research Institute, and use 
it as a testbed for metacomputing. Several results in distributed parallel computing, 
fluid/structure couple simulation for airplane ITTII and hybrid plasma simulation lll2l . 
were reported, and the number of supported function and platform are increas- 
ing. The latest progress and distributions of Stampi are obtained from following 
URL; http : //ssp . koma. jaeri .go . jp/en/ stampi .html . 

Einally, the authors would like to thank reviewers for their valuable comments, and 
Toshiya Kimura, Hironori Kasahara and Michael M.Resch for their helpful discussion. 
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Abstract Metacomputing applications can often be composed from 
sub-applications written for parallel, but not wide-area distributed sys- 
tems. For these systems tools like MPI or PVM are well known and 
many legacy applications exist. This paper describes the usage of such 
sub-applications as components for a metacomputing application. The 
approach is based on two ideas: First, abstract data objects encapsu- 
late the binary executables and potentially the source code. Second, the 
compilation and execution of MPI components is provided as an ab- 
stract service type. This approach is implemented as prototype in the 
metacomputing infrastructure Arnica using Java and CORBA. 



1 Introduction 

Metacomputing adds new challenges for the application development regarding 
heterogeneity, security, reliability, and many more aspects. Therefore, several 
infrastructures exist which support the development of metacomputing applica- 
tions. Well known examples are Globus [2| and Legion jS]. 

These complex infrastructures require the application developer to learn new 
tools and new programming models. However, many developers of parallel appli- 
cations are not willing to invest the time for learning another complex proprietary 
system. They want to use their acquired knowledge which usually includes an 
imperative language like Fortran or C-| — h, and a middleware based on the mes- 
sage passing paradigm, like MPI or PVM. Furthermore, developers often rely on 
legacy software systems which are not compatible with the new infrastructure. 

One solution to the problem is to build a metacomputing infrastructure which 
represents a global batch system, e.g. |E|. The application developer builds its 
“usual” program and gives it to the infrastructure. The infrastructure transfers 
the program to free computation resources, executes it there, and returns the 
results. The disadvantage of this solution is that only the computation resources 
of one computation domain can be used for the application. 

Often an application can be broken into several sub-applications which can 
be executed concurrently in different domains. Using the global batch system 
approach, the application developer himself has to do the distribution and co- 
ordination of the sub-applications. Additionally, it is hard to implement two 
concurrent sub-applications which exchange information. 
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Therefore this paper introduces a different new approach. An abstract data 
object stores the executable and potentially the source code of a parallel MPI 
sub-application. The application developer can then use a service type to ex- 
ecute this sub-application on the computer with the least load in a location- 
transparent way. This service type is encapsulated in a component which can be 
glued together with other components to form a metacomputing application. 

This paper is structured as follows. In the next section the metacomputing 
infrastructure Arnica is introduced, which serves as a testbed for our approach. 
In Sec. 0the integration of the MPI support into the infrastructure is described. 
Then, this approach is compared to related work. Finally, the last section gives 
conclusions and outlines future work. 

2 Overview of Arnica 

Amica0 is an experimental metacomputing infrastructure to research some new 
approaches. Applications are built from predefined components and connectors 
by adapting them and gluing them together. This is done in the architecture 
description language Acme 0. An introduction to this programming model is 
provided in the next section. 

An application is compiled into a light-weight code format. An interpreter 
executes this code as a distributed application using the computation and stor- 
age resources of the metacomputer. The computation resources are modeled as 
service providers. This will be described in Sec. l2.2L An application access storage 
resources using abstract data objects, which are only accessible using CORBA 
interfaces. For a more detailed description see p[). 

2.1 Structure and Semantics of Applications 

In the programming model of Arnica, applications are built out of components 
attached to connectors. Components represent application data and application 
functionality while connectors represent data and control flow. Figure Q] shows a 
visualization of an example application which performs concurrent simulations. 

Components are depicted as rectangles with their type printed in the top 
segment and internal parameters in the bottom. They have named ports which 
can be used for data access. Connectors glue components together. They are 
depicted as oval boxes. Bold arrows specify the control flow while the dotted 
lines specify the flow of data. 

The example application starts with the activation of a UserBrick, a com- 
ponent for the integration of Java code. This allows support for graphical user 
interfaces and formatting of application data so that it fits the needs of the com- 
putation services. In the application a class is instantiated to load and edit an 
executable file which performs the simulation. 

Components representing data objects have two important properties: a type 
and a name. The executable file is stored in a data object named simulationExe. 

^ Abstract Metacomputing Infrastructure for Coarse Grained Applications 
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Fig.l. Graphical representation of an example application written in Acme that 
performs concurrent simulation runs 



All data objects have four ports for read and write access and for creation and 
deletion. If the creation port is not connected the data object is automatically 
created at startup time of the application. 

When the first UserBrick terminates, another UserBrick containing the Java 
class CreateScenarios is activated, which loads some scenarios for simulation 
into a data object named Scenarios with the type Map [Binaries] . This means 
each scenario is stored as a sequence of bytes under a unique key. 

In the next step a farm connector is activated. For every element of the 
map connected to its port IM, it creates a replicate of its worker. A worker is 
a component itself with an internal representation consisting of an arbitrary 
number of connected components. Bindings connect the internal representation 
to the application by propagating the data flow. 

In the example, the worker consists only of one MetaBrick. This is a com- 
ponent which provides access to the computation resources. The functionality 
is described in the following section. It is parameterized with the service type 
execution and with attachments to data objects with the input data and for 
the output data and to the data object containing the executable. The actual 
semantics of this special service type and the format of the attached data objects 
are described in Sec. 01 

Now for every scenario a simulation is computed and the results are stored in 
a data object named Results. A UserBrick is activated to visualize the results 
and, finally, the application terminates. 
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2.2 Computation Subsystem 

As shown in the previous section the computation resources are modeled as 
service providers. In the Arnica infrastructure there are three main classes to 
support distributed computing: 

— A brick factory can generate a service provider for a given service type and a 
computer architecture. This can be done in arbitrary ways, e.g., by compiling 
a substituted source code template or simply by using a library. 

— A brick provides a service for a fixed service type and computer architecture. 

— A computation unit manages the actual resources. It is the only instance 
which provides access to the resources. Additionally it gives information 
about the current load. 

Figure 0 shows the interaction of these classes. A MetaBrick directs a brick 
factory to generate a brick. After generation, the brick is registered at the com- 
putation unit along with its demands on resources. Demands are specified as a 
potentially open interval (minimum, maximum) of processors and the need for 
disk space. 



:MetaBrick 



activation 



I :brick factory 



:computation unit 





create brick 
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1 




build 
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b:brick 


register(b, demands) 





resuits ; computation 



deliver 

resuits 



activate(resources) H 



notify 



X 



^ free 
M resources 



Fig.2. Interaction of the main computation classes 



When the resources are available and a scheduling strategy chooses this brick 
to run, it is activated. With this activation comes a specification of resources 
the brick can use for its computation. The format of this specification depends 
on the concrete environment. For a workstation cluster, the Internet addresses 
of the assigned workstations are transmitted plus an NFS directory. After the 
computation the results are delivered to the MetaBrick and the computation 
unit is notified that the used resources are free. 

3 Integration of MPI Support 

As stated in the introduction, many parallel applications are developed using 
portable infrastructures for local-area networks like, e.g., MPI or PVM. In the 
following these applications are called sub-applications. 
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To investigate how these sub-applications can be smoothly integrated in a 
metacomputing framework, we designed a general framework but concentrated 
on MPI for a practical evaluation of the concept. The integration is based upon 
the introduction of a new service type for executing sub-applications and a new 
data object type for storing them. 

The sub-application is given as source code. For execution on a given platform 
it is automatically configured, compiled, and finally executed. Because configu- 
ration and compilation depends strongly on the application, scripts provided by 
the application developer are used. If the sub-application was already compiled 
for the given platform the old binary code can be reused to save the compila- 
tion and link time. Input and output data are automatically transferred to the 
sub- applic at ion . 

Figure El shows the CORBA interface ExecutableDO of the new data object 
type Executable introduced to store executable binaries along with their source 
code. Two binary types exist, specified by the union BinaryType. One is for 
single, directly executable programs and the other is for archives containing 
several programs. Archives are first unpacked and then the program with the 
name mainProg is started. 

The structure Executable consists of a binary type, a binary executable as 
sequence of bytes, and the required run time support. Currently only MPI is sup- 
ported as special runtime environment. For every computer architecture, speci- 
fied by the structure CUArchitecture, an executable structure can be stored. 

The source code is stored as an attribute of type SourceCode. It contains an 
archive, i.e., a gzipped tar file, the commands to build the executable from the 
archive and to configure the build process, and the type and required runtime 
support of the produced executable. FigureEjshows a user interface to edit an ex- 
ecutable data object. This user interface is also used by the example application 
in Fig. 0to edit the executable data object. 

To make the executables available to an application the service type execution 
is added to Arnica. Figure El shows a flowchart specifying the semantics of this 
new service type using the elements of an executable data object. If an executable 
exists for a given architecture it is put into the local file system. If it is an archive 
it is first unpacked. Then the input data is copied from the data objects to the file 
system and the program is started. After termination the results are copied to the 
data objects and the file system is cleared. In addition to files it is also possible 
to use the standard input and output streams as interfaces to the executable. 

If the executable does not exist but source code is provided, the executable is 
built using a configuration and a compilation phase. The result is inserted into 
the executable data object and executed as described above. If the build process 
or the execution of the MPI sub-application failed the locally used resources are 
freed and the execution is retried on another computation resource. Therefore 
failures of the MPI system do not effect the Arnica application. 

We implemented two implementations for this service type in Java. One 
utilizes multiprocessor workstations with shared memory, while the other utilizes 
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interface ExecutableDO : DataObject { 

/* type of the executable binary */ 
union BinaryType switch (char) 

{ case ’s’: boolean single; /* empty union elements are invalid */ 
case ’a’: struct BinaryArchive { string mainProg; } archive ; 

}; 

enum RuntimeSupport {NONE, MPI}; 

struct Executable 
{ BinaryType binaryType ; 

OctetSeq binary; 

RuntimeSupport runtimeSupport ; }; 
struct SourceCode 

{ string compileCommand; /* build executable*/ 

string conf igureCommand; /* configure build process*/ 

string execFileName; /* name of the executable */ 

BinaryType execType; /* type of the executable */ 

RuntimeSupport execRuntSupp; 

OctetSeq archive; }; 
attribute SourceCode sourceCode; 

/* returns true iff a binary exists for a given architecture. */ 
boolean existsForArchitecture (in computation: :CUArchitecture arch); 

/* adds a new executable. */ 

void addExecutable(in computation: :CUArchitecture arch, 
in Executable prog) ; 

/** returns an executable for a given architecture. */ 

Executable getExecutable (in computation: :CUArchitecture arch); } 



Fig.3. The CORBA interface for data objects of type Executable 




Fig.4. Screen shots of the user interface for editing an executable data object 
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Fig.5. A flowchart specifying the usage of the executable data object 



clusters of multiprocessor workstations. The MPI runtime support is built on 
MPICH. 

To measure the management overhead we built an application that compiles 
and starts a small C program (« 5KB). The Arnica infrastructure and the 
application were started on the same computer (Pentium III-450 MHz) . Omitting 
the compilation time it took 529 msec from the activation of the MetaBrick to 
the start of the compiled program. This is clearly a negligible amount of time if 
the sub-application runs at least one minute. 



4 Related Work 

WebFlow H provides a high-level programming environment for the Globus 
system. Applications are graphically composed from components which encap- 
sulate services of the infrastructure. Currently, an in-depth knowledge about the 
resource management system of Globus is needed for application development. 
Additionally the programming model concentrates on functional components 
without providing a sophisticated data model. Therefore the transparent usage 
of MPI components as data objects is not possible. 

GIS PI uses data flow graphs instead of the more general task graphs. Unlike 
Arnica, GIS aims at the manipulation of geographical data and is specialized in 
the manipulation of streaming data. It uses DISC World [Z|, which is also based 
on service types. However, it does not support abstract data objects. 

UNICORE P provides abstract job objects which are represented as directed 
acyclic graphs. The nodes resemble tasks of speciflc classes (e.g., compile task, 
transfer task) and are thus more restricted than the general service nodes of 
Arnica. The integration and local execution of Java code is not supported by 
UNICORE. Application data is explicitly loaded and transferred by special file 
tasks. UNICORE offers the composition of an application from sub-applications. 
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such as MPI programs, but it uses a very explicit approach. The user must 
perform load balancing and file transfers on his own, whereas our approach is 
location-transparent. 

Summarizing, all related approaches lack support for abstract data objects. 
This is essential for our approach because it accommodates transparency of 
location and heterogeneity by hiding dynamic compilation. Furthermore, it offers 
efficiency by automatically reusing precompiled code. 

5 Conclusions and Future Work 

This paper introduces a new approach to integrate general MPI applications 
into a metacomputing application using an intuitive programming model based 
on components and connectors. This approach facilitates reuse of programming 
knowledge and of legacy systems for the creation of wide-area distributed ap- 
plications. We have implemented a prototype of this approach using the Arnica 
metacomputing infrastructure. The prototype offers location transparency, het- 
erogeneity by dynamic compilation, and efficiency by reusing precompiled code. 

Currently, we are enhancing the infrastructure to support authentication and 
accounting. We then plan to test our approach on high performance computers. 
We are also working on the support of pipe parallelism in the programming model 
to broaden the class of supported applications. After the programming model is 
fixed a visual editor will be implemented to ease the application development. 
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Abstract. We study the issue of interconnecting computer algebra sys- 
tem Maple and the message passing environment PVM. A prototype sys- 
tem, namely PVMaple, is presented. The system allows to create concur- 
rent tasks and have them executed by Maple kernels running on different 
machines of a network. 



1 Introduction 

Recent developments in parallel distributed systems have resulted in increased 
use of suitable environments, like PVM, which make the message-passing pro- 
gramming to solve complex problems easier and faster for users. PVM framework 
allows the user to write his applications as a collection of cooperative tasks. 

Computer algebra systems (CAS) can be successfully used in prototyping 
sequential algorithms for symbolic or numeric solution of mathematical pro- 
blems. Maple, widely used environment for scientific computing, is such a CAS. 
Constructing prototypes for parallel algorithms in Maple is an actual challenging 
problem. 

Several attempts have been made to combine Maple with parallel or dis- 
tributed computation features. ||Maple|| COl, developed at the beginning of the 
1990s, is a portable system for parallel symbolic computations built as an in- 
terface between the parallel declarative programming language Strand and the 
sequential CAS Maple. Sugarbush 0 combines the parallelism of C/Linda with 
the Maple V Release 3 kernel. In P porting Maple kernel to the Intel Paragon 
family of massively parallel distributed memory machines is described. A number 
of Maple kernels running on different machines of a local network can communi- 
cate also by a simple mechanism based on reading and writing on shared files in 
a global network file system m- FoxBox PI provides an MPI-compliant distri- 
bution mechanism that allows for parallel and distributed execution of FoxBox 
programs; it has a client/server style interface to Maple. The most recent Dis- 
tributed Maple (PI and |S]) is a portable system for writing parallel programs 
in Maple, which allows to create concurrent tasks and have them executed by 
Maple kernels running on different machines of a network. The system can be 
used in any network environment where Maple and Java are available. It provides 
message passing facilities via a global heap. 
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We present a prototype system allowing to study the issue of interconnec- 
ting PVM and Maple, namely Parallel Virtual Maple (shorter PVMaple). Its aim 
is to interface the flexible process and virtual machine control from the PVM 
system with several Maple processes thus allowing Maple applications the ability 
to inter-operate transparently across multiple heterogeneous hosts. It provides 
facilities for post-execution analysis of the behavior of a session. The design 
principles are very similar to that of Distributed Maple. 

2 PVMaple Inside 

PVMaple system proposes an extension of Maple capabilities to distributed com- 
putations for workstations grouped into a Parallel Virtual Machine. The user 
interacts with the system via the text oriented Maple front-end. PVMaple can 
be used in any network environment where Maple and PVM are available. The 
necessary packages are enumerated in Table E 



Table 1. PVMaple components and dependencies 



Component 


Format 


Machine 


Restrictions 


Package pvm.m 


Maple 


local 


to be readed before any pvm[ ] 


Command messenger 


Binary 


local 


to be started after pvmd 


File with binary paths 


Text 


all 


in the same path 


Maple V 


Binary 


all 


at least Release 4 


Pvmd 


Binary 


all 


at least version 3.4 



Figure E indicates the active processes on each machine of the PVMaple 
system. The communication routes are represented by arrows. The command- 
messenger ensures the inter-communication between the Maple processes via 
the PVM environment. Initialization of a Parallel Virtual Maple session, creation 
of Maple processes and inter-processes communications are provided by pvm.m 
package. 

Table 0enumerates the functions and constants which are currently included 
in the pvm.m package. Four separate issues are here specially addressed: process 
start-up facilities, Maple process identifiers, transparent message passing, and 
post-execution analysis of Maple session behaviour. A Maple process is identified 
by a tuple pair [machine_name,processesJd]. A process can communicate with 
another one by using his tuple and via the package’s function calls pvm[send] 
and pvm[receive]. Note that the Maple processes cannot communicate directly. A 
Maple send command is registered by the associated command messenger which 
will send a message (via pvmds) to the destination command messenger which 
at his turn informs the associate Maple process about the incoming message. 

Figure El shows the time sequence in which the processes are activated when 
PVMaple is started and also the effects of different commands from the pvm.m 
package (via pvmd and pvm-functions implemented in the command messenger). 
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Fig. 1. Processes and inter-communication routes 



Table 2. Functions and constants 



Function Synopsis 


Meanings/Parameters 


spawn 


pvm [spawn] (segrtence) 


create Maple processes; 
sequence is a sequence of elements 
like \stationjname,processes-no] 


send 


inf— pvm [send] (destin 


, message) send Maple commands; 

int is the message identifier, 
destin can be ‘all‘ for all processes 
or [statiomname,process_id] 
message: a string with Maple commands 


receive 


&fc=pvm[receive](mesid,sottrce) receive processes results; 

list store results returned by each process, 
mesid is the int given by a send or ‘all‘ 
source: ‘all‘ for all processes 
or [station^name,processAd\ 


exit 


pvm [exit] 0 


return a success or fail message 


settime 


pvm[setttime]() 


start time registration 


time 


pvm[time]() 


show a time graphic 


version 


pvm [version] 0 


version message 



Constant Synopsis Meaning 

Procid mteger:=pvm[Procld] process identifier on a particular station 

Madrid mteger:=pvm[Machld] station number into the virtual machine 
Taskid mteger:=pvm[Taskld] process identifier into the virtual machine 

Tasks lisfc=pvm[tasks] fist of process identifiers [stationjname,processAcl\ 



In Distributed Maple communications between the Java scheduler and the 
Maple processes are based on pipes; the corresponding Unix functions are not 
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shttion resources 



local ' processor 



prtK-essor I 



processor 2 



run PI "Maple 
Maple 



pvm|send| 











Comm, mess , 

J ' 


1 



pvm[receive| 

pvm|exit| 

quit 



pvin spawn 
p\Tn send 



pvm reev 
pvm kill 



Com m. mess. 
read paths 
pvm spawn' 
reev 

write instr. 

read instr. ■ 
pvm send 



^ Maple 



read result 
pvm send 
pvm probe 



pvm exit ^quit 



read instr. 

pvm(send| 

write result 
wait instr. 



Comm mess 
read paths 
pvm spiiwn 
pvm reev 

w rite instr. 



pvm reev 
write instr. 



read result 
pvm send 
pvm probe 
pvm exit 



Maple 



read instr. 



pvmlreceive) 
read instr. 
write result 
wail instr. 



quit 



Fig. 2. Communications and components inter-dependencies 



available under Microsoft Windows (a Maple V Release 4 problem). In PVMaple 
package currently available for PC’s running Microsoft Windows a communication 
between Maple and its associated command messenger uses text files. 



3 Examples 

In this section we evaluate the behaviour of our prototype. The tests have been 
run in a network composed of five identical PC Pentium 250 MHz with 64 MB of 
RAM running Windows '95. The PVM 3.4 version for Windows '95 0 was used 
for message passing. 

Table El presents three examples of command sequences written in Maple 
using the pvm.m package. In the first one we can see two kind of a message send 
command: between the main Maple process (local process) and a remote Maple 
process (located on machine dana), and between a remote Maple process (located 
on machine dana) and another remote Maple process (located on machine paula). 
The Maple command sequences which will be executed on remote processes are 
those included between pairs of back quotes (Maple commands starting with M 
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Table 3. Examples of distributed commands wrote in Maple 



load the PVMaple functions 
start time chronometer 
start two tasks 
result: pvm[tasks] values 
prepare a message to be sended 
and finish with an assignation 
send command to the first task 
receive last command’s result 
1st task last command’s result 
prepare receive from task 1 
and finish with an addition 
send command to the second task 
receive last command’s result 
2nd task last command’s result 
kill the tasks and stop the time 
PVMaple stopped 
show the time graph 



> restart: with(linalg): read ‘pvm.m‘: pvm[settime] (): 

> pvm[spawn] ([dana,l],[sica,l],[bubu,l],[paula,l]): 

Tasks: [['local', 1], [dana, 1], [sica, 1], [bubu, 1[, [paula,l[[ 

> # pvm[spawn] ([dana,4[[): variant for 4 tasks on the same processor 

> prepare the matrix and vector 

> n:=500: p:=5: r:=n/p: x:=randmatrix(n,l): X:=mat2str(x): 

> # prepare the commands for remote tasks 

> message: = 'with(linalg): r: = '.r.': n: = '.n.': x:='.X.':': 

> message:= ".message.' A:=randmatrix(r,n): multiply(A,x);': 

> send the commands to all remote tasks 

> s:=pvm[send] ('all', message): 

> compute the local part of the product 

> z:=array(l..n): A:=randmatrix(r,n): u:=multiply(A,x): 

> receive the remote parts of the final vector 

> v:=pvm[receive] (s,'all'): 

> for i from 1 to n-r do z[i[:=v[iquo(i-l,r)+l][irem(i-l,r) +1,1] od: 

> for i from n-r+1 to n do z[i]:=u[i-n+r,l] od: 

> pvm[exit](): pvm[time] (); 

Example 3: parallel ODE method implementation (one iterative step) 

> paralleLRunge_Kutta_method_onestep_pvm:=proc(y,h,t) local message, L,R; 

> for i to q do number of stages 

> message:=prepare_fsolve_stage(h,y,n,t,i,L,R): 

> S[i]:=pvm[sendJ (['all', 1], message): to all p(i) — 1 processors 

> L[i]:=fsolve(message[l[) )) solve local stages 

> R[i]: =pvm[receive] (S,['all',l]); od: receive remote task results 

> K:=assembly(R,S); y: = update(y,K); end: ^ new solution approximation 



Example 1: communication between processes 

> restart; read ‘pvm.m‘\ 

> pvm[settime}{); 

> pvm/spawn]([dana,l],[paula,l]); 

Tasks: [['local', 1], [dana, 1], [paula, 1[] 

> messl:='pvm[send]([paula,l], "global S; 
S:=2"); S:=5;‘: 

> M:=pvm/send]([dana,l],messl): 
pvm /receive]( M , [d a n a , 1] ) ; 

[5] 

> mess2:='S:=3; pvm/receive]([dana,l],‘all‘); 
S+10;‘; 

> N:=pvm/send]([paula,l],mess2): 
pvm/receive]( N, [pa u la, 1[); 

[ 12 ] 

> pvm[exit](); 

PVMaple quitted 

> pvm[time]{)' 



Example 2: matrix-vector multiplication 
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t; 







paula 



dann 



local 




Un,c(«Jconds) 



ComiiMiul messeneer 
Maple 



Coniniaiui messenger 
Maple 



C onttiiaml messenger 
Maple 



Legemi. 

active period 



Fig. 3. Result of pvm[time] command from Example 1, Table 0 



and N). The graphical result of the pvm[time] command is presented in FigureEl 
The active time periods of Maple processes and auxiliary command messengers 
are indicated by horizontal lines. The total active time (between a pvm [settime] 
and a pvm[exit] command) of a Maple process is expressed in percent, so that 
the user can easily estimates the load balance of the distributed algorithm, the 
delays and the idleness sources. 

Example 2 from Table 0 presents a distributed variant of a square matrix- 
vector multiplication. A randomly generated 500-dimensional vector x, send to 
five different processors, is multiplied with local randomly generated 100x500 
matrices, and the multiplication results are assembled into a final 500-dimensional 
vector z. The high usage percent reported in Figure 0(a) as results of settime 
command indicate a possible efficient implementation of a distributed multipli- 
cation algorithm. Indeed, comparing the time necessary to obtain the result of 
multiply(A, x) command with A a 500x500 matrix and the time depicted in Fig- 
ure 0(a) (between send commands and eexit), we get a speed-up of 4.02 (using 
5 processors). 

The third sequence of commands presented in Table 0 is a part of a largest 
program generating numerical solutions of initial value problems for ordinary 
differential equations of the form y'(t) = f{t,y{t)), y{to) = yo, f ■ [to, to + 
TjxR" ^R". A short description of parallel solving strategies for such a prob- 
lem is presented in jS|. One step of a particular solving strategy, namely using 
parallel Runge-Kutta method (iterative method), involves the solution of a non- 
linear system in the unknown n-dimensional vectors kj,j = 1 . . . s, computed 
in q < s stages using in each stage p{i) processors i = 1 . . . s (ideally s = qp). 
Maple’s function fsolve is used in this case. When n is large the processes inter- 
communications are insignificant relative to the time requested by fsolve calls. 
Figure 0(b) indicates how the time is spent in applying four steps of a particular 
Runge-Kutta method. Hammer’s method 0, to a nonlinear initial value prob- 
lem (convection-diffusion problem 0) with n = 40 differential equations. The 
speed-up in this case is 1.59 (2 processors). More details about ODE integration 
using PVMaple are presented in 0. 
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Fig. 4. Time diagrams and Maple’s usage percent for (a) matrix-vector multipli- 
cation (n X n, respectively n-dimensional objects, with n = 500) and PVMaple 
starting overhead (Example 2 from Table|3|) (b) four steps of Hammer’s method 
applied to an initial value problem of dimension n = 40 and PVMaple stopping 
overhead (Example 3) 



4 Proposed Improvements and Perspectives 

PVMaple not pretends to be better or more complex than the tools mentioned 
in the first section. It is designed for the common user which cannot afford to 
bye a parallel computer or another high-performance computer, but who has 
access to a workstation of a local network (a PC for example) with Maple, and 
who want to solve a problem requesting a large amount of computer resources. 
The system intends to be a public domain tool (unlike ||Maple||). Based on PVM 
functions, it is faster than those systems using shared files for communications. 
Like Distributed Maple it provide an environment where parallel programming is 
possible within Maple. 

PVMaple will be ported soon to Unix platforms. The command-messenger, 
written in C, must be recompiled when it is ported to a new operating system. 
The next step consists in extending the system by introducing new functions with 
equivalents in the PVM library. We estimate that at the end of this year, a beta- 
version of PVMaple will be available for freely download from the author’s web 
page. Future implementation activities will follow two directions: experiments 
with applications of PVMaple and command-messenger rewriting for interfaces 
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with Mathematica and Matlab. Apart from extending PVMaple, our planes for 
future include prototyping parallel algorithms for initial value problems (project 
D-NODE 1 ^ based on Distributed Maple and also PVMaple). 

5 Conclusions 

This document described the basic concepts behind the Parallel Virtual Maple 
package for cooperative work of Maple processes on networks. We showed that 
by mapping Maple onto PVM we can get an efficient and extensible environ- 
ment. This was realized through the use of PVM inter-communications which are 
handles to Maple inter-communications. The current prototyped system proved 
its usefulness from the software’s point of view in solving large mathematical 
problems. Due to its relatively low communication/computation ratio it can be 
implemented in local networks. Large sets of small (PC class) workstations can 
be used as a virtual machine with quite high computational power. 

Acknowledgments. The author would like to express her appreciation to Wolf- 
gang Schreiner, the creator of Distributed Maple, and to thank him for the fruitful 
discussions and precious references. 
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Abstract. PC clusters are still more popular platform for high per- 
formance computing. But there is still lack of freely available tools for 
resource monitoring and management usable for efficient workload distri- 
bution. In this paper, a monitoring system for PC clusters called Cluster 
Information Service (CIS) is described. Its purpose is to provide clients 
(resource management system or application scheduler) with informa- 
tion about availability of resources in PC cluster. This information can 
help the scheduler to improve performance of parallel application. CIS 
is designed to have as low intrusiveness as possible while keeping a high 
detail of monitoring data. 

The possibility of improving the performance of PVM/MPI applications 
is also discussed. 



1 Introduction 

PC clusters gain still more popularity, mainly because they can deliver super- 
computer performance at far lower cost than commercial supercomputers. Unlike 
clusters of workstations (CoW), computing nodes in PC clusters often consist 
only of main board, CPU, memory and network card which allows to dramati- 
cally reduce the price/performance ratio. Moreover, whole cluster acts as single 
entity (from management point of view), therefore node availability does not 
depend on node owners and thus is more predictable. 

Most of existing PC clusters are based on Linux operating system and cluster- 
ing is done using message passing environments, usually MPI or PVM. Although 
there is still more freely available software, there is still lack of tools for resource 
monitoring and management. For embarrassingly parallel application it is pos- 
sible to use some of available queuing systems with integrated load balancing. 
However, for communication-intensive application it is very important to balance 
network load as well. Since dynamic optimization problem is NP-hard, many dif- 
ferent heuristics were proposed P . They are mostly designed for special type of 
parallel algorithm and they have various needs for information about resources. 
While random task placement strategy uses no information at all, preemptive 
methods with process migration P need to know detailed information about 

* This work was supported by the Slovak Scientific Grant Agency within Research 
Project No.2/7186/20 
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process (size, communication dependencies, CPU requirements) as well as in- 
formation about capacity of network links, and, of course, an estimated node 
performance. 

To minimize the impact of wrong decisions, the scheduler must predict avail- 
ability of resources in the future El . For this, it needs to know not only current 
state but also dynamic changes in the past. This imposes high requirements on 
monitoring system. In general, the level of information detail depends on the 
intrusiveness of monitoring. 

In this paper, a low-intrusive monitoring system for PC clusters is presented. 
Next section gives an overview about its design and features. Section 3 is de- 
voted to short term CPU performance prediction, built in the information server. 
Resource management extensions of the PVM environment as well as the pos- 
sibility to use monitoring information in PVM/MPI applications are addressed 
in section 4. Section 5 gives a brief description of the other monitoring systems 
usable in PC clusters. Finally, section 6 concludes the paper. 

2 Cluster Information Service 

As mentioned above, the detail of monitoring data depends on intrusiveness of 
monitoring system. Thus, the main design goal of CIS was to reduce the overhead 
of monitoring as much as possible, mainly by its structure, data transfer and data 
acquisition techniques. 

The architecture of high performance PC clusters used for scientific compu- 
tations consists, in general, of a set of computing nodes connected by high speed 
network and one or more front-end nodes usually not used for computations. 
CIS structure follows this architecture (Fig.l). CIS server is running on a head 
node and collects information from monitors running on computing nodes. 




Fig. 1. Structure of Cluster Information Service 



Monitors are designed to be as simple as possible. They check monitored ob- 
jects in regular time intervals and inform the server about the changes. The over- 
head of data acquisition is reduced using special kernel probes instead of standard 
(textual) kernel interface. Probes provide monitors with monitoring data in bi- 
nary form and all objects at once. Moreover, they are able to detect monitored 



CIS - A Monitoring System for PC Clusters 227 



events and notify the monitor. The events (object creation/termination) are sent 
to the server immediately, thus reducing server data inconsistency. Instrument- 
ing OS kernel also make it possible to measure the activity of communication 
endpoints (sockets). This information is essential for identification of communi- 
cation dependencies between tasks. The messages from monitors are sent using 
UDP protocol (Fig. 2) reducing the overhead of data transmission. 



workstations 



cluster 



computing 

nodes 




RPC 



UDP 



-client requests 
-monitoring data 
•control data 



Fig. 2. Data transfer 



Clients can obtain monitoring information from CIS server via RPC calls. 
Unlike most of fully distributed monitoring systems designed for CoWs, the re- 
quests are not propagated to monitors. CIS server keeps up-to-date view about 
whole cluster and provides clients with requested information. Therefore the load 
imposed on internal network does not depend on frequency of client requests. 
It only depends on monitoring interval, number of objects and their activity. 
System administrator can adjust the monitoring interval and thereby can con- 
trol the level of monitoring system intrusiveness. This structure was selected 
considering possible integration of clusters into large computational grids where 
the number of requests for information about resources availability can be quite 
high. Client RPC calls are encapsulated in API library included in CIS package. 

The information provided by CIS is listed in Tab.l. 



Table 1. The information provided by CIS 



system memory, swap, number of processes, average load, CPU availability 

processes identification, owner, priority, start time, used memory, CPU usage, 

disk transfer rates 

sockets source and destination addresses, owner, transfer rates 

network devices name, status, transfer rates, collisions 
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Monitoring information can be viewed using visual client called xcis (Fig. 3). 
Like the top utility it provides user with an ongoing look on monitored objects. 
One instance of xcis running on user workstation is able to display the data 
from multiple CIS servers. In host list folder with system information the user 
can select the hosts to be viewed in the other folders (processes, communication 
links, network devices, CPU availability and network usage history). 




Fig. 3. Xcis screenshots 



Monitoring information can be saved to a file for later processing either using 
xcis or background daemon. Archived information can be used for long term 
analysis of dynamic behavior of parallel applications. CIS package includes also 
tools for processing of record files (cutting, merging, printing in text form). And, 
of course, xcis is able to visualize the data from the record files. 

The overhead of CIS monitors measured on Pentium III 550 MHz with 384 
MB of RAM, 1 00Mbit network interface and monitoring interval set to one 
second is shown in Tab. 2. For comparison the table contains also the overhead 
of top utility. 

In the empty state there were 13 system processes on the node. In loaded 
state there were 10 additional user processes computing and communicating in 
cycle. Since standard Linux kernel is not able to count time slices shorter than 
one clock tick (usually 10ms) and monitors were consuming less than one tick, 
for correct measurement it was necessary to modify time accounting code in the 
kernel. 
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Table 2. The overhead of CIS monitors and the top utility 



interval sysmon procmon sockmon netmon top 



s %CPU B/s %CPU B/s %CPU B/s %CPU B/s %CPU B/s 



empty 


1 


0.01 


67 


0.01 


49 


<0.01 


119 


<0.01 


46 


1.5 


1743 


loaded 


1 


0.01 


75 


0.01 


96 


<0.01 


272 


<0.01 


45 


1.66 


1647 


empty 


0.1 


0.11 


750 


0.13 


491 


0.05 


1289 


0.05 


402 






loaded 


0.1 


0.12 


750 


0.18 


904 


0.08 


2021 


0.06 


402 







3 CPU Performance Prediction 

Main goal of an application scheduler is to achieve the shortest possible run time. 
Since the resources in PC clusters are usually shared and their usage varies in 
time, the scheduler must predict their availability in the future. Wrong decisions 
may lead to performance degradation. 

For estimating the CPU availability most of the schedulers use load average 
values provided by UNIX operating systems. They represent average number of 
processes waiting in run-queue for 1, 5 and 15 minutes. Main drawbacks of load 
average are slow reaction on changes and no sensitivity to process priorities. In 
Network Weather Service HD the authors tried to overcome the problem with 
priorities using special probe process that measures real CPU availability for 
tuning the estimation algorithm. This approach imposes additional overhead on 
monitoring. 

CIS server contains a simple algorithm for short term CPU performance 
prediction, based on simulation of Linux process scheduler. Having information 
about CPU usage for all processes, idle time and priorities, it can estimate how 
much CPU time could be allocated to a new process (response time). CPU 
usage of process is computed from last three changes in order to include also 
processes with period larger than the monitoring interval. The algorithm for 
estimating the CPU availability is based on elimination of processes with lower 
CPU consumption than they can have (non CPU-bound). The rest of available 
CPU time is divided between CPU-bound processes and virtual new process 
(according to their priority). 

The result can be used by application scheduler along with maximal perfor- 
mance the scheduled task can have to estimate current node performance. Since 
process start/exit is reported immediately, reaction of predicted CPU perfor- 
mance on these events is immediate as well. 

Mean error of estimation and benchmark (linpack) measured with multiple 
CPU bound processes with various priorities was 1.73%, though for more real- 
istic workload the error may be higher. Especially if there will be higher cache 
contention. 

Client can, however, obtain the information needed for prediction and predict 
the performance itself. It is therefore possible to build an advisory system with 
more sophisticated prediction techniques on top of CIS. 
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4 Dynamic Optimization of PVM and MPI Applications 

CIS was developed as a part of Dynamic Load Balancing (DLB) system for 
PVM P), mainly for dynamic process creation jS|. To avoid multiplying of the 
overhead by collecting the same information by multiple applications, we decided 
to separate monitoring subsystem and make it independent from PVM. Another 
reason for doing this was to allow to use monitoring information also in data 
parallel applications. 

Our DLB system is based on semi-distributed approach, in which the nodes in 
virtual machine are grouped into so called spheres. Each sphere has one scheduler 
that manages the sphere according to a centralized strategy and when needed, 
it can transfer tasks into a less loaded sphere. At the level of spheres, load 
balancing works on fully distributed principle. This approach is a compromise 
between centralized and fully distributed strategies and it is well suited for multi- 
cluster environments. Since PVM has no built-in support for transparent process 
migration, our DLB system uses non-preemptive load balancing algorithms. In 
the future, we plan to experiment with PVM extensions for process migration 
(e.g. Condor, Dynamite or MOSIX). 

The implementation is based on the ability of PVM to forward the requests 
for spawning new tasks to a special process called resource manager (an example 
of such plug-in process called srm is distributed along with PVM source code). 
The interconnection between our DLB system and CIS is shown in Fig. 4. 




Fig. 4. Semi-distributed dynamic load balancing 



Srm uses very simple load balancing algorithm based on task number, but 
it is quite easy to enhance it with more sophisticated strategy. Modified srm 
(centralized) which distributes tasks according to the information provided by 
CIS server can be downloaded from CIS homepage. 

In MPI, process creation is different for each implementation and the state of 
their support for MPI 2 functions for process management is under examination. 
However, scheduling using the information from CIS can be integrated into an 
application (PVM or MPI). 
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5 Related Work 

Most of existing monitoring tools are designed for performance analysis of an 
application (e.g. for detection of bottlenecks). There are only a few monitoring 
tools that can provide information for dynamic optimization. 

Monitoring systems differ by purpose, architecture, provided information and 
also data acquisition mechanisms. General model of monitoring system for dis- 
tributed systems can be found in 0. A system for monitoring of heterogeneous 
workstations called Node Status Reporter is presented in |^. It consists of a 
set of daemons (one per host) communicating on client-server principle. It pro- 
vides clients with static and dynamic information about hosts. In |2j commercial 
object-oriented monitoring system PARMON that provides information about 
hosts, processes, network devices and kernel activities is presented. The infor- 
mation is accessible via Java interface. Another Java based monitoring system 
for large clusters of workstations called ClusterProbe is presented in j0|. It is de- 
signed to be open, flexible and scalable. Monitoring information can be accessed 
through multiple adaptors (including CORE A, SQL and HTTP). Monitoring of 
large computational grids is addressed in Network Weather Service [I^. Along 
with monitoring of hosts and networks, it contains also performance forecasting 
techniques. 

The monitoring systems mentioned above are not (yet) freely available. One 
of the freely available monitoring systems for PC clusters (except very simple 
b Watch for displaying average load on nodes) is SMILE Cluster Management 
System m which contains also monitoring subsystem and API for accessing 
monitoring information. It provides quite rich information about nodes includ- 
ing network statistics. Its overhead is affected by client-server access to moni- 
toring information on computing nodes. Monitoring just CPU information, load 
average, memory and swap usage on the system described in section 2 with mon- 
itoring interval set to one second has CPU overhead around 1.75% and network 
overhead around 7 kB/s. This overhead is multiplied by the number of clients. 

In February 2000, the SGI company released a monitoring infrastructure of 
the ’’Performance Co-Pilot” as open source. Unfortunately, at the time of writing 
this paper their visualization tools were not freely available. 

CIS differs from the monitoring systems mentioned above by its structure and 
data acquisition techniques. Using kernel instrumentation it is able to monitor 
also parameters not provided by standard kernel interface. On the other hand, 
CIS has no means for managing monitored objects. 

6 Conclusions and Future Work 

Cluster Information Service presented in this paper can provide application 
schedulers with the information about availability of resources in PC cluster. 
The overhead of data acquisition is reduced using special probes in OS kernel. 
Unlike most of other monitoring systems based on client-server principle, CIS 
provides continuous monitoring of computing nodes. Thus, its intrusiveness is 
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relatively stable. Knowing the history of CPU usage, CIS server can make short 
term prediction of CPU availability. 

The information provided by CIS can be useful not only for application sched- 
ulers but also for cluster management system or system for providing Quality of 
Service (QoS). 

Releasing it as open source (http://ups.savba.sk/parcom/cluster/cis.html) 
we hope that it will evolve to best fit to the requirements of wide range of users of 
PC clusters. In the future we plan to implement CORE A and HTTP interfaces 
and to connect CIS to the information system of the Globus project |S] . 
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Abstract. In this paper a new tool for monitoring the different queues 
of messages in a PVM environment is presented. The main aim of imple- 
menting this facility is to provide a means of capturing the bottlenecks 
and overheads of the communication system in a PVM-Linux cluster. 
Also, It will allow to know the communication pattern of a distributed 
application. Its good behaviour has been proved experimentally. 



1 Introduction 

Nowadays, one of the most important goals in distributed computing and spe- 
cially in PVM 0 environments is performance evaluation such as Paradyn jjj, 
Aims m and XPVM j^. To study this, some questions must be answered: such 
as how good the message passing libraries of the distributed environment are or 
where there is room for improving their performance and so on. 

We are interested in knowing what the relevant factors are and how far 
these influence system performance, focusing the study on the communication 
related ones. With this purpose in mind, a monitoring tool named Monito 0 
was developed, which samples the state of the communication buffers (composed 
of messages, fragments, packets and frames) of a PVM-Linux system, from top 
(PVM), through the kernel (sockets and logical network device) to bottom (phys- 
ical network device). 

The /proc Linux file system offers much information about the commu- 
nication subsystem, but this information is insufficient to obtain a global view 
of its behaviour on each instant (bottlenecks, saturations, reasons for crashing 
in distributed applications, and so on). The Monito tool was designated to pro- 
vide a means of investigating and localizing these phenomena. Other tools like 
netperf 0, Paragraph 0 and so on, give global statistical performance, but do 
not provide information about the state of each communication buffer. 

This paper is organized as follows. Section El describes the main buffers and 
structures of the communication subsystem. Monito implementation and op- 
eration details are presented in section 0 In section 0 Monito behaviour is 
evaluated. Finally, the conclusions and future work are detailed. 

* This work was supported by the CICYT under contract TIC98-0433 
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Fig. 1. (left) PVM message structure (pmsg) and (right) sk_buff and packet 
structures. The packet Data area contains the information to be transmitted. 
next and prev links sk_buff structures. 



2 Analysis of the Communication System 

In this section the main queues involved in the communication process are ana- 
lyzed, from the PVM to the physical network device layer. 



2.1 PVM Layer 

PVM allows the execution of distributed applications in two different communi- 
cation modes: RouteDirect and DontRoute. In DontRoute mode, all communica- 
tion between tasks is done through the pvmd daemon. In this way the daemon- 
daemon communication is through UDP protocol and the task-daemon commu- 
nication is by means of TCP or UNIX Domain protocol. On the other hand, in 
RouteDirect mode, communication between remote tasks uses the TCP protocol. 

The PVM transmission unit is the message (with variable length). Every 
message has an associated pmsg structure, which is divided into fixed lengths 
fragments (= 4096 bytes). Initially, a head fragment called master is created, 
then every time that a new fragment is filled up, another one is initialized and 
linked to the previous one and so on. Fig. Of left) shows the structure of a PVM 
message made up of a master fragment and two data fragments (the first is full) . 

Every PVM task has an associated dynamic list called pvmrxlist, which stores 
the received messages, waiting for such a task. On the other hand, all the mes- 
sages sent by a PVM task are stored in a static queue called txlist, which has a 
maximum capacity of 100 messages. 

The pvm daemon (pvmd) converts fragments into packets and vice versa. 
A packet is a fragment with additional control information. It maintains two 
different queues, locltasks and hosts, for packet delivery to all the local tasks 
and to other hosts respectively. 
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2.2 Socket and Protocol Layer 

The fragment sent by the PVM layer is decomposed into MTU {Maximum 
Transmission Unit) size packets. A structure called sk_buff will be associated 
with every packet. This structure is used by Linux for passing data through the 
TCP/IP protocol layers jS|. In emission/reception of packets, every protocol will 
add/extract control information to/from its reserved Head and Tail space (see 
Fig-Hright)). 

In the emission/reception of packets to/from the logical network device, the 
socket layer creates/receives a new sk_buff and stores it in the write-queue/ 
receive-queue, both with a max. capacity of 65535 bytes. 



2.3 Logical Network Device Layer 

In transmission, the skJ>uff structures, coming from the protocol layer are stored 
in one of three buffering queues (with a max. capacity of 100 elements per queue). 
The choice of the queue will depend on the priority of the packet, interactive 
(highest priority), normal (PVM messages) and background (lowest priority). 
The head of every queue is stored in an array called buffs. On the other hand, 
the packets received from the physical device are stored in a list called backlog, 
which has a maximum length of 300 buffers. 



2.4 Physical Network Device Layer (Driver) 

Our communication board is an Intel EtherExpress 10/100 Mbps, which has an 
i82558 microprocessor. The i82558 communicates with the kernel by means of 
a shared memory mechanism. This memory is divided into two different packet 
(named frame in this layer) sk_buff queues, CBL, for sending packets to/(and 
RFA, for receiving packets from) the network. The maximum number of elements 
in both queues is 16. 



3 Monito: The Monitoring Tool 



Based on the previous section, the most interesting transmission/reception que- 
ues to be analyzed in each layer are hosts, locltasks and txlist/pvmrxlist in the 
PVM, writc-queue/receive-queue in the socket, buffs/backlog and CBL/RFA in 
the logical and physical device respectively. 

The set of implemented utilities are: two PVM services, pvm_getpvmdstats 
and pvm_getaskstats, the stadsoc, stadque and staddev modules, the dev-queues 
system call and finally Netmon, an application that monitors and collects infor- 
mation about these utilities. 
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Table 1. Netmon arguments. 



SNetmon -dsp -tmt [-s -f| [Interface] 

-dsp : sampling period {sp) in milliseconds 
-tmt : monitoring time {mt) in seconds 
-s : output to display 
-f : output to file Netmon.dat 
[Interface] : sampling interface, default ethO 




Fig. 2. (left) pvmd monitoring; (right) pvm tasks monitoring. 



3.1 Netmon 

The Netmon arguments are: the sampling period (sp), the total monitoring time 
(mt) and sample storage file. The format of the Netmon invocation call is shown 
in table E Netmon does the following operations in every sampling period: 

1. Obtain PVM information 

(a) Obtain pvmd (PVM daemon) statistics. This is carried out by the pvm 
call pvm_getpvmdstats (see Fig. □(left)). The function pvm_getpvmdstats 
sends a TM_PVMDSTAT message to the daemon and waits for a re- 
sponse from it (1). In the daemon a new function, tm_pvmdstat was 
implemented to reply to Netmon with another TM_PVMDSTAT mes- 
sage containing the information of the hosts and locltasks structures (2), 
such as, for example, the packets to deliver to remote hosts (in hosts) 
and packets to deliver to the local tasks (in locltasks). 

(b) Obtain PVM tasks statistics. This begins in the new pvm call pvm_getask- 
stats (see Fig. fright)). The function pvm_getaskstats sends a TM_TAS- 
KSTAT message to the daemon and waits for a response from it (1). In 
the daemon a new function for dealing with this kind of messages was 
implemented, named tm_taskstat. This function sends a TC_TASKINFO 
message to all the pvm tasks (2). Next, this function waits for the reply 
from all the pvm tasks through a new pvm-tc-taskinfo function (3) and 
then sends a TM_TASKSTAT message to Netmon (4) . The information 
obtained is the number and size of the buffered messages, waiting for 
sending in txlist (or to be taken in pvmrxlist) queue. 
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Table 2. stadsoc, stadque and staddev information. 



stadsoc 


stadque 


staddev 


protocol type (tcp, udp, raw) 


# of queues 


received collision packets 


@IP and port Source 


max. queue length 


pending packets in RFA 


@IP and port Target 


Interactive queue 


delayed transmission packets 


sk_bufPs in recv_queue 


Normal queue 


one trans. collisions 


sk_buff’s in write_queue 


Background queue 


multiple trans. collisions 


total bytes in recv_queue 


backlog sk_buff’s 


pending packets in CBL 


total bytes in write_queue 




retransmissions 



2. Obtain Linux information 

(a) Obtain the sockets statistics. Netmon reads the file /proc/net /stadsoc, 
created and maintained by the stadsoc module for storing writc-queue 
and receivc-queue information. The stadsoc column in table 0 shows 
the information provided by stadsoc. Note that this information is also 
supplied by the kernel in three different files but the overhead in reading 
these can be unacceptable in small sampling periods. This is the reason 
for implementing this function. 

(b) Obtain logical device statistics. The method used to get information 
about the backlog and buffs queues of the logical device is the same as in 
the previously explained module, stadsoc. The module is named stadque 
and its associated file is /proc/net/ stadque (see stadque column in table 
EJ. There is no known utility that gets this kind of information. 

3. Obtain network device information. To capture information about the phys- 
ical network device (see table El column staddev), not supported in the 
/proc/net file system and also to sample its RFA and CBL queues, another 
module, staddev was implemented. Its associated file is /proc/net /staddev. 

Note that the PVM data is collected by message passing. This can produce 
some overhead in the monitor. When it finalizes, Netmon displays the additional 
Netmon execution, the percentage of samples which overlapped the sampling 
period and the maximum extra time required in a sampling period. 

4 Experimentation 

The trials were performed in a PVM distributed environment, a cluster made up 
of a 100 Mbps Fast Ethernet network and four PCs with the same characteristics: 
a 350Mhz Pentium II processor, 128 MB of RAM, 512 KB of cache, Linux o.s. 
(kernel v. 2.0.36) and PVM 3.4.0. 

The good behaviour of Monito is checked by means of a synthetic benchmark. 
Next, two kernel benchmarks from the NAS suite P| are run in order to show 
an example of Monito’s use for evaluating the performance (and finding the 
bottlenecks) of the communication system in a PVM-Linux cluster. 
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Table 3. Sampling period overlap. Fault means sintree crash due to lack of 
memory. 



N 


M 


%Overlap in DontRoute 


N 


M 


%Overlap in RouteDirect 


32 


8KB 


0% 


25 


8KB 


0% 


32 


4MB 


Fault 


25 


2MB 


48% 


750 


8KB 


4% 


25 


4MB 


Fault 


1000 


8KB 


Fault 


32 


8KB 


Fault 



4.1 Monito Evaluation 

The benchmark implemented, called sintree^ works on a communication pattern 
of one to vary, and vary to one. sintree accepts two arguments, the number of 
composing processes (N) that continuously send sized (M) messages by multi- 
casting (by default N = 25 and M = 8KB). In the trials the two PVM operating 
modes (RouteDirect and DontRoute P]) and notation processes/size_of_messages 
were used. For example 32/8K means that sintree arguments are N = 32 pro- 
cesses and M = 8KB. The default Netmon arguments were sp = 100/rs and 
mt = 200s. 

Table 0 shows the percentage of times that the sampling period was over- 
lapped while monitoring the sintree application. This table informs us of the 
critical values for N and M in each PVM operation mode; note that a more 
precise search should be done but this is out of the scope of this article. 

Fig. 0 shows the main results obtained in the physical and socket layers. 
The figure on the left reports the results obtained for the physical layer in the 
DontRoute 32/2MB case. The CBL queue is filled due to the great number of 
fragmented packets transfered from higher levels. Remember that the maximum 
CBL and RFA capacity is 16 packets, but for security reasons, the driver keeps 
two CBL elements in reserve. For this reason the maximum number that appears 
in Fig. El(left) is 14. Fig. 0 (right) shows the socket layer statistics for the Don- 
tRoute 750/8K case. Note that the reception queue is saturated (the maximum 
capacity is 65535 bytes). There is no buffer saturation or relevant events in the 
other cases in these layers and thus the results obtained are not shown. 

Fig. 0 reports the most representative values obtained from the pvm layer 
queues. Monito gives the size of every queue in bytes or packets as can be seen 
in Fig. 0(a), where locltasks statistics of the PVM daemon in bytes/ (packets) 
are reflected. Note that the size of the pvmd packets is 4032 Bytes (equal to, 
for example, 3689280 Bytes from Fig. 0 (a) divided by 915). Fig. 0 (b) shows 
the reception queue (pvmrxlist) in the parent task of the sintree benchmark. 
Observe the result of dividing the max. pvmrxlist capacity reached (= 46137344 
Bytes) by the number of packets (= 22) is 2097152 Bytes (the sending message 
size); this also demonstrates the good behaviour of Monito. 
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Fig. 3. DontRoute (left) buffered packets in transmission (CBL queue) for 
32/2MB and (right) pvmd socket buffer in reception (receive-queue) for 750/8K. 
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Fig. 4. (left) DontRoute locltasks queue in bytes 32p/2MB; (right) RouteDirect 
pvmrxlist queue 25p/2MB. 



4.2 NAS Benchmarks 

In order to reflect the use of Monito for evaluating the performance of the com- 
munication system, two kernel benchmarks, MG and IS, for class A problem size 
of the NAS suite are run. The execution time for IS and MG with one process 
per node are 156s and 103s respectively. The main transmission queues from the 
PVM layer to the driver layer are shown in the Fig. El 

In the CBL queue the maximum capacity is hardly reached. Also, in the same 
queue the number of iterations of every benchmark (10 in IS and 4 in MG) is 
displayed as their respective number of impulses. The extreme communication 
required in the IS benchmark is revealed overall in the hosts queue, although 
the saturation isn’t reached (its max. capacity is determined by the remaining 
memory). However, more accurate research is required in order to determine 
exactly which level the main bottlenecks and overheads of the communication 
system are in in a PVM-Linux cluster. 
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Fig. 5. Main transmission queues from the PVM layer until the device layer. 



5 Conclusions 

Monito, a tool for measuring the state of all the message queues in a PVM- 
Linux communication environment is presented. The analysis goes from the PVM 
queues, through the kernel queues, to the physical network device ones. By 
executing some benchmarks and comparing different data collected with the 
expected results, the correct behaviour of Monito was shown. This tool will 
allow in-depth study of communication bottlenecks and their correction. 

Future work is directed towards new algorithms to decrease overhead in 
sampling data (in the current implementation, the sampling period often over- 
lapped). Another goal is to expand Monito for also evaluating MPI communica- 
tion performance. 



References 

[1] R. Card, E. Dumas, and F. Mevel. The Linux Kernel Book. Wiley, 1998. 

[2] Information Networks Division. HP Co. Netperf: A network performance bench- 
mark. http://www.netperf.org/netperf/NetperfPage.html, 1996. 

[3] Parkbench Committe. Parkbench 2.0. http://www.netlib.org/park-bench, 1996. 

[4] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Mancliek, and V. Sunderam. 
PVM: Paralell Virtual Machine - A User’s Guide and Tutorial for Networked 
Parallel Computing. The MIT Press, 1994. 



Monito: A Communication Monitoring Tool for a PVM-Linux Environment 



241 



[5] M.T. Heath and J.A. Etheridge. Visualizing performace of parallel programs. 
IEEE Software, 8(5):29-39, September 1991. 

[6] J.A. Kohl and G.A. Geist. Xpvm 1.0 user’s guide. Technical Report ORNL/TM- 
12981, Computer Science and Mathematics Division, Oak Ridge National Labo- 
ratory, April 1995. 

[7] B.P. Miller, J.K. Hollingsworth, and M.D. Callaghan. Environments and Tools for 
Parallel Scientific Computing. J.J. Dongarra and B. Tourencheau (eds.), SIAM 
Press, 1994. 

[8] J. Postel. Rfc 791 - internet protocol: Protocol specification. September 1981. 

[9] F. Solsona, F. Gine, J.L. Lerida, P. Hernandez, and E. Luque. Monito vl.O. 
http://www.eup.udl.es/diei, 2000. 

[10] J.C. Yan, M. Schmidt, and C. Schulbach. The automated instrumentation and 
monitoring systems (aims) - version 3.2 user’s guide. NAS Technical Report NAS- 
97-001, January 1997. 



Interoperability of OCM-Based On-Line Tools 



Marian Bubak^’^, Wlodzimierz Funika^, Bartosz Balis^, and 
Roland Wismiiller^ 

^ Institute of Computer Science, AGH, al. Mickiewicza 30, 30-059 Krakow, Poland 
{bubak,funika}@uci . agh. edu.pl, balis@icsr . agh.edu.pl 
phone: (-f48 12) 617 39 64, /ax: (-f48 12) 633 80 54 
^ Academic Computer Centre - CYFRONET, Nawojki 11, 30-950 Krakow, Poland 
^ LRR-TUM - Technische Universitat Miinchen, D-80290 Miinchen, Germany 

wismuellSin . turn . de 



Abstract. In the course of a parallel application development, the use 
of supporting tools for debugging, performance analysis or visualization 
is indispensable. Since the services provided by the tools usually com- 
plement one another’s, it is necessary to enable the tools to cooperate 
with each other. This cooperation, often referenced as interoperability, 
is feasible by means of the OCM universal monitoring system. This pa- 
per presents some issues of interoperability of two OCM-based tools, the 
DETOP debugger and the PATOP performance analyzer. An insight 
into the tool environment based on the OCM is also provided. 

Keywords: monitoring, on-line tools, interoperability, OMIS. 

1 Introduction 

Tools for parallel programming support are important components of parallel 
application development. Each type of tools for parallel programming support 
has a well-defined functionality, therefore, in order to achieve a complex set of 
services in a tool environment, the interoperability of tools is highly desirable, 
which is meant as the capability to run concurrently and be applied to the same 
application with possible synergetic effect jS]. Ideally, we would like to enable 
interoperability between two tools coming from different vendors. However, such 
tools are likely to be incompatible with each other and it might be even not 
possible to run them concurrently due to low-level conflicts, or, event if the tools 
are able to run concurrently, further conflicts may occur at higher levels. 

On-line tools need a specialized module for observing and possibly manipu- 
lating of a parallel program state, which is called monitoring system. Sometimes 
this module is integrated with a tool but it is much more profitable to have 
a separate facility to provide information on parallel application processes and 
mediate in controlling the application. One benefit of this approach is modular- 
ity: the tool development is separated from the monitoring system development. 
The most important benefit, however, is that multiple tools are enabled to use 
a single monitoring system, which not only reduces the overhead induced by 
running multiple tools, but also gives prerequisites for tools interoperability. 
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This paper presents how interoperability of on-line tools is enabled in a tool 
environment based on the OCM (OMIS-Compliant Monitor) universal monitor- 
ing infrastructure. 

2 Interoperability 

The term interoperability, in the context of monitoring, refers to on-line tools, 
and means their capability to run concurrently and be applied to the same appli- 
cation . Moreover, a cooperation between tools is possible to provide additional 
functionality to the tool environment. For example, if a performance analyzer 
runs concurrently with a load balancer, and the latter migrates a process , the 
former should visualize the migration on its displays. And vice versa, a process 
migration may be forced manually via the peformance analyzer. 

The first basic requirement for interoperability concerns the possibility to run 
different tools concurrently. In case of tools coming from different vendors, sup- 
plied with their own monitoring modules, struetural eonflicts between different 
portions of the monitors may occur, which may even prevent tools from con- 
current running. As multiple tools may request an operation on a single object 
(e.g. writing into a process address space) at the same time, an infrastructure 
must exist to handle the multiple requests. For these reasons, unless tools form 
a monolithic, integrated environment being dependent on each others’ implemen- 
tation, interoperability of tools based on distinct monitoring systems is hardly 
possible due to likely structural conflicts or conflicts on exclusive objects among 
the monitoring modules |^. 

Further problems may occur at the user level and manifest in logical conflicts. 
For example, if a debugger and a visualizer work concurrently, and a process is 
stopped by the former, the latter might not show it on its displays unless it 
is notified of the event. This results in inconsistencies in representation of the 
monitored system state, which we call consistency problems. 

The next two sections describe a universal monitoring system OCM and 
provides an insight into the interoperability support within the OCM. 

3 OCM - A Universal Monitoring System 

3.1 General Structure 

The OCM is an implementation of the OMIS (On-line Monitoring Interface 
Specification) |3| specification, being a centralized distributed system, composed 
of a central module, called NDU (Node Distribution Unit), which is interfaced 
to a tool, and a collection of modules, called local monitors, which are interfaced 
to the application. The operation of the OCM is thoroughly presented in |S|. 

In accordance with the OMIS specification, the target parallel system is 
viewed by the OCM as a hierarchical set of objects. The specification defines 
5 types of objects: nodes, processes, threads, message queues and messages, 
with a collection of services to operate on objects. The services fall into three 
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categories: information services to obtain information about an object, manip- 
ulation services to change the state of an object and event services to trigger an 
action list whenever a specified event occurs. 

The OCM is currently adapted to support PVM |2| and MPI applica- 
tions. In the course of the Tool-Set 0 project development, several tools were 
adapted to work on top of the OCM: the DETOP debugger, the PATOP per- 
formance analyzer and the VISTOP visualizer. 



3.2 Interoperability Support in the OCM 

The OCM provides some coordination features to address low-level conflicts in 
accessing shared objects by multiple tools 0 : 

— requests referring to a single object are mutually exclusive, 

— requests operating on more than one node are distributed to local monitors 
via an atomic multicast operation, to provide their execution in the same 
order on each node. 

— requests can be locked to prevent any other requests on any node from 
execution while the locked requests is being executed. 

Furthermore, the concept of events, as defined by the OMIS specification, 
allows to address higher level conflicts. These issues are described in the following 
sections. 

4 OCM-Based Tool Environment 

In this section, we focus on the interoperability of two tools, DETOP and 
PATOP. An insight into the structure of an OCM-based tool environment and 
some its components is provided. Also covered is the startup “protocol” of the 
environment. 



4.1 General Structure of an OCM-Based Tool Environment 

General structure of an OCM-based tool environment composed of DETOP and 
PATOP is shown in Fig. 0 The OCM is a layer between the application and 
tools. In fact, the tools communicate with the OCM indirectly through a high 
level routine library, called ULIBS 0. 

OCTET is a newly developed tool to provide a management of the environ- 
ment. It will be described in the next subsection. 

4.2 The OCTET Tool 

The OCTET (OCM-based Tool Environment top-level Tool) tool was created 
to work on top of the tool environment. OCTET performs two tasks, the first of 
them being the start-up of the tool environment, which includes spawning the 
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application processes and running the tools, while the second is to provide tools 
with information to resolve consistency conflicts. 

OCTET is a console application that provides a simple interface for setting 
up a number of parameters like the name of a parallel environment (PVM or 
MPI), paths to the application and tools’ executables, number of processes to 
be run (in case of MPI only). A sample session with OCTET is shown in Fig. 
0 The set command is used to set up the environment including the parallel 
library type (PVM or MPI), path to the application executable and possibly 
other parameters. The run command schedules the specified tool to be run. 
The tools as well as the application are actually run after the go command 
is invoked. Commands which are not recognized by OCTET are considered as 
explicit requests to the monitoring system, hence are sent to the OCM and replies 
to them are printed to the standard output. 



4.3 Startup Mechanism in the Tool Environment 

The startup of the tool environment is managed by OCTET. Time dependencies 
at the tool environment startup are shown in Fig. 0 

First, OCTET establishes communication with the monitoring system, which 
typically means start of of the OCM. Next, basing on the information provided 
by the user, OCTET orders the OCM to start the application. This process may 
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octet> set mode MPI 

application type set to MPI 

octet> set app-path "$HOME/MPI/cpi" 

application path set to /home/balis/MPI/cpi 

octet> run pat op 

PATOP has been scheduled to run 

octet> run detop 

DETOP has been scheduled to run 

octet> go 

starting the tool environment... 

[tools and the application are being started] 
octet> :mpi_get_proclist () 



Fig. 2. A sample session with OCTET. 



vary depending on the parallel environmen10. In the next step, OCTET starts the 
tools and provides them with a list of application process tokens. The tools are 
supposed to attach to each of the application processes. Finally, OCTET provides 
the tools with information on the environment to enable possible interactions 
between them (see Subsection tb.2|l . 




^ In the current implementation, this pattern is actually reversed in case of an MPI 
application: the application itself is started prior to the OCM. 



Interoperability of OCM-Based On-Line Tools 247 



5 Interoperability of DETOP and PATOP 

5.1 Possible Benefits of DETOP and PATOP Cooperation 

Let us consider a long-time running parallel application. PATOP, as a perfor- 
mance analysis tool, can be used to monitor and visualize the application exe- 
cution. Suppose the application reveals an unexpected behaviour, observed via 
PATOP performance displays. We would like undoubtedly to localize the ap- 
plication’s point of execution to find the cause of the behaviour. This is what 
DETOP helps with, as it works at the source code level. After having suspended 
the application execution with DETOP and a possible examination of proper 
variables, DETOP is used to resume the application execution. 

5.2 Direct Interactions 

PATOP and DETOP cooperation reveals consistency problems (section 0. The 
incorrect behaviour occurs in two cases: 

— When the application is started with PATOP, it starts reading the perfor- 
mance data from the OCM and updates performance displays to visualize the 
execution. However, when the application is suspended by DETOP, PATOP 
proceeds reading data and updating displays, while the expected behaviour is 
that, PATOP hangs up monitoring while the application is being suspended. 
This is only possible if PATOP is notified whenever DETOP suspends the 
application execution. 

— Once the application is started with DETOP, PATOP does not start mon- 
itoring. Again, a notification that the application processes has been con- 
tinued is necessary. Similarly, when the application is resumed by DETOP 
after having been breaked, the notification is also needed. 

Fortunately, the notion of events provided by the OMIS specification help 
resolve these problems 0. Basically, PATOP needs to “program” a reaction to 
each event of thread suspension or continuation. It can be achieved if PATOP 
issues two following conditional requests to the OCM: 

thread_has_been_stopped( [] ) : print ( [$proc] ) 
thread_has_been_continued( [] ) : print ( [$proc] ) 

The semantics of these requests is as follows: whenever a proeess to which PATOP 
is attached has been stopped (continued) , the process identifier of the stopped 
(continued) process is returned. The events are handled by means of a callback 
mechanism. The process’ identifier is actually passed to the appropriate call- 
back function, which is invoked on every occurrence of the event. This callback 
function performs actions to stop (or resume) the measurements. 

The succeeding questions are: 

1. PATOP can program reactions to various scenarios of tools’ cooperation. 
However, how can PATOP learn the actual configuration of the tool environ- 
ment (which tools are running) so that it can perform appropriate actions? 



248 



Marian Bubak et al. 



2. Where should the above requests be implemented? We might decide to insert 
the appropriate code directly to PATOP, however, this would be an intru- 
sion into the tool implementation, which contradicts the principal ideas of 
a universal monitoring environment, where tools are independent. 

In jn|, the second problem is resolved by dynamically inserting and calling the 
necessary code in the tool via machine-level monitoring techniques like dynamic 
instrumentation. A drawback of this approach is its complexity and the resulting 
poor portability. The approach presented in [0| is currently implemented only 
for PVM on Sparc/ Solaris. For our environment, we thus chose a more high-level 
approach. For each tool, a specific library is provided in which every possible 
scenario of tools’ cooperation would be handled. 

One might get the impression that in this approach tools actually know 
about each other, thus it is arguable whether they remain independent of each 
other. However, all the interoperability related code is implemented as a new 
module in ULIBS which might be considered as an independent component of 
the tool environment (Fig.^), although it is implemented as a library being linked 
to the tools’ executables. Thereby the tools themselves are not really affected. 
It should be stressed, that the new module is designed to provide a general 
support for interoperability of any combination of tools, not only DETOP and 
PATOP. Although at present only the case of DETOP-PATOP interoperability 
is implemented, the other scenarios can easily be added. 

Note that with the implementation described above, the tools have to be 
provided with information on which tools are running in the environment. The 
component which possesses de facto knowledge on the whole system, in particu- 
lar, which tools are running is the OCTET tool. OCTET can pass the informa- 
tion to all the tools, which would cause a part of the interoperability module to 
be activated, which is appropriate to the given scenario. For example, if OCTET 
knows that it would run PATOP and DETOP, it can pass to PATOP the infor- 
mation that DETOP is running. This information would actually be processed 
by the startup module in ULIBS and passed to the interoperability module, in 
which, as a result, the two requests described earlier would be issued. 

6 Concluding Remarks 

Interoperability of on-line tools for parallel programming support is a key feature 
to build a powerful, easy-to-adapt tool environment. With the interoperability 
support lying in the environment infrastructure, not in the tools themselves, the 
user is enabled to customize its environment by picking tools which best fit his 
needs. 

A system that supports interoperability must meet a number of requirements. 
First of all, the tools must be able to run concurrently. Next, a way to enable 
interactions between the tools must be provided. Finally, there must be a control 
mechanism to coordinate access requests to the target system objects. 

The OCM monitoring system provides mechanisms that are enough to meet 
these requirements. Tools adapted to the OCM are enabled to run concurrently 
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and operate on the same object. Moreover, a definition of tools’ interactions, 
which leads to effective tool cooperation is possible without intrusion into their 
implementation. 

Future work will be concentrated on the problem of direct interactions be- 
tween tools. The current implementation is just the most basic implementation of 
the idea presented in Section 15.21 Further development will be focused on extend- 
ing the role of OCTET in “programming” the interactions. Currently, OCTET 
just provides the basic information, while the whole rest is up to ULIBS. In 
future, OCTET can even provide general directives on how to “program” the 
interactions of tools. 
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In this paper, we present a new model for parallel program development called 
Data Driven Graph (DDG). DDG integrates scheduling to parallel program 
development for cluster of workstations with PVM/MPI communication. DDG 
API library allows users to write efficient, robust parallel programs with 
minimal difficulty. Our experiments demonstrate the new parallel program 
model with real applications. 



1, Introduction 

Advances in hardware and software technologies have led to increased interest in the 
use of large-scale parallel and distributed systems for database, real-time, and other 
large applications. One of the biggest issues in such systems is the development of 
effective techniques for the distribution of tasks of a parallel program on multiple 
processors. The efficiency of execution of the parallel program critically depends on 
the strategies used to schedule and distribute the tasks among processing elements. 

Task allocation can be performed either dynamically during the execution of the 
program or statically at compile time [7]. Static task allocation and scheduling attempt 
to predict the program execution behavior at compilation time and to distribute 
program tasks among the processors accordingly. This approach can eliminate the 
additional overheads of the redistribution process during the execution. On the other 
hand, dynamic task scheduling is based on the distribution of tasks among the 
processors during the execution, with the aim of minimizing communication 
overheads and balancing the load among processors. The approach is especially 
beneficial if the program behavior cannot be determined before the execution. 

Although scheduling has been intensively studied from the beginning of parallel 
and distributed processing, its applications for real programs are still difficult. 
Message-passing libraries like PVM/MPI provide little support for DAG generation, 
task migration, etc., which are necessary for scheduling. Therefore developing a 
program model, which integrate scheduling to message-passing systems is very 
important. 
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2. Message-Passing Libraries and Scheduling 

Parallel program development can be divided into two steps. In the first step, the 
parallel program is divided into a set of interacting sequential sub-problems, often 
called as tasks, which can run in parallel. In the second step, the tasks are assigned to 
processors and scheduled in such a way that program can best use the system. The 
parallel program is often written using message-passing libraries like PVM/MPI. 

2.1 Message-Passing Libraries 

Typical message-passing libraries are Parallel Virtual Machines (PVM) and Message- 
Passing Interface (MPI). These libraries allow programmers to write portable and 
efficient parallel programs in programming languages C or Fortran. 

The largest disadvantage of PVM/MPI is that it cannot match corresponding 
send ( ) and recv ( ) routines at compilation time. The result of this disadvantage is 
that almost all programming errors in communication, from simple errors like wrong 
addresses, unmatched data format, etc. to more complex errors like race condition, 
deadlocks, etc. cannot be detected at compilation time. Run-time testing and 
debugging is well known as one of the most exhaustive and boring work of software 
development. Furthermore complex errors like race conditions are not easy to detect; 
they may appear only in very specific condition. Such an error is very dangerous 
because it may not appear during testing process and appear when the users do not 
expect it. 

For proper understanding of PVM/MPI, they can be compared with assembler 
languages in sequential programming. Both are used to write the most efficient 
programs. However, programs in both environments are not structured, most of errors 
cannot be detected at compilation time and testing and debugging them are time- 
consuming. Finding a higher level model for message-passing programs, which is 
easier to write programs and has comparable efficiency is imperative. 

2.2 Scheduling 

Scheduling can be static or dynamic. In static scheduling the behavior of the parallel 
program is predictable before its execution. Therefore static scheduling is often done 
before execution so it does not require run-time overhead. In dynamic scheduling the 
behavior of the parallel program is not known in advance so scheduling has to be 
performed at run-time. 

Almost all of static scheduling algorithms are based on Directed Acyclic Graph 
(DAG) [5]. Each node of the DAG represents a task when edges represent 
communication and precedence relationship among tasks. 

The largest problem of static scheduling is how to get a DAG from a parallel 
program. Some scheduling tools provide a graphical environment for drawing DAG 
[2]. However, drawing a DAG of a large parallel program is exhaustive work. Other 
provide some functional or descriptive languages for generating DAG. However, 
most of parallel programs are written in C/C-l-l- or Fortran, using PVM/MPI and 
mixing functional languages with them is not welcomed. 
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Dynamic scheduling algorithms often require moving tasks from a processor to 
another in order to balance the loads of processors [6]. In PVM/MPI, tasks are 
running in preemptive mode, so task reallocation requires suspending the migrating 
task, saving current state of the task, transferring the task with its state to the target 
processor, restoring the state of the task and resuming the execution of the task on the 
target processor. Therefore task migration is complex and costly process; it also 
requires large supports from operating systems and/or programmers. 



3, Data Driven Graph 

Data Driven Graph [1] is a new program model that integrates scheduling to parallel 
program development. The basic properties of DDG are as follows: 

• It is a parallel program model: DDG allows specifying tasks, data dependence, 
parallelism, etc. 

• It is a parallel program for High Performance Computing'. DDG is a program 
model for computation-intensive application. The additional overhead of DDG 
is low enough for developing efficient program. 

• It is a parallel program model with scheduling'. DDG allows generating DAG, 
static and dynamic task scheduling, run-time task reallocation. Scheduling 
algorithms can be integrated to DDG. 

• It is a parallel program for development'. Unlike many program models that are 
only for theoretical analysis, DDG is a program model for software 
development. DDG Application Programming Interface (API) provides a 
simple way to write robust, efficient programs with minimal difficulty. 



3.1 Basic Ideas 

Fig. 1 shows the steps of parallel program development. DAG is the basis of many 
scheduling algorithms therefore DAG generator is one of the primary aims of DDG. 
DAG generator requires data dependence among tasks. In order to get the data 
dependence among tasks, DDG has to know for each task which variable the task uses 
and which variable the task modifies. It can be done by tracing the code of the task, 
however it is time consuming. Furthermore, as many C/C++ programmers use 
pointers to data in their programs, it is very difficult to get which data a pointer refers 
to by tracing code. Therefore, in DDG each task must declare which variable it uses 
and modifies. 

Data dependence is also the basis for data synchronization. As discussed in 
Section 2.1, data synchronization is the source of potential errors in parallel programs 
and programmers spend a large part of time for synchronizing data, testing and 
debugging communication errors. It is requisite if the data synchronization is done 
automatically and programmers can concentrate on coding tasks. 

Data use declaration of a task has to be consistent with its code. However, the 
codes of tasks change during parallel program development and using separate data 
use declaration is not welcomed. In DDG, the input and output data of a task are 
referred to in its code as formal parameters and the real variables are passed to the 
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code as parameters during task creation. By wrapping task creation routine, DDG can 
determine which data the task uses without separate declaration. 




Fig. 1. Parallel program development in DDG 

Fig. 1 shows the steps of parallel programming development. The steps with dark 
background are done by DDG. There are only two steps left to programmers: program 
decomposition and task coding. Program decomposition is the most critical step; the 
performance of the parallel program strongly depends on this step so it is left to 
programmers. On the basis of the knowledge of the solving problem and of target 
hardware environment, programmers can choose the best way to divide the solving 
problem to a set of tasks. Coding tasks, of course, cannot be done automatically. It is 
shown in Fig. 1 that DDG can do most of the work for programmers. 



3.2 Task and Data Definition 

void code_of_taskl ( int x) 

{ . . . } 

void code_of_task3 (int x, int y) 

{ . . . } 

main ( ) 

{ int a , b , c; 

create_task (code_of_taskl , wo (a) ) ; 
create_task (code_of_taskl , wo (b) ) ; 
create_task (code_of_task3 , ro(a), rw(b)); 

} 

Fig. 2. Task and data creation in DDG. 



Tasks in DDG are created by calling create_task (code , parameter,...) 
where code is a function/procedure/subprogram in High Level Languages (HLL), and 
parameters are variables with access right (read-only (ro), read-write (rw), or write- 
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only (wo)). Because the code of a task contains no information about which real data 
it uses, several tasks can have the same code, but different variables as parameters. An 
example of task creation can be found in Fig. 2. 

Tasks in DDG, which are assigned to the same processor, can share variables. In 
order to remove anti-dependence and output dependence, variables in DDG may have 
multiple value: each task that writes to a variable creates a new copy (version) of the 
variable. Versions that are not used by any unfinished tasks are automatically 
removed from memory. Number of versions of a variable depends on the number of 
threads of tasks. Fig. 3 shows an example of multi-version variables. For simplicity, 
the tasks in Fig. 3 contain only one command line and task creation is not shown. If 
the tasks are executed in the order they are created, only one version of variable a 
exists at a moment. If taskv, tasks are executed in parallel with taskl, task2, 
version 3 and version 1 exist simultaneously in memory. If multi-version 
variables were not used, taskv and tasks would have to be executed sequentially 
after all other tasks. DDG remembers internal data dependence structures similar to 
the scheme in Fig. 3 and always provides correct versions for tasks. 

1. a = 10; 

2 . b = a + 5 ; 

3 . c = a * 3 ; 

4 . a = a + 5 ; 

5. d = a / 2; 

6 . e = a - 1 ; 

7. a = 7; 

S. f = a3 + 2; 




Fig. 3. Multiple-version variable in DDG 



3.3 DAG Generation 

DDG can generate the DAG graph of a parallel program directly from the structures 
in Fig. 3. It is easy to see that a DAG graph can be generated by connecting the 
arrows that go to and from the same version in Fig. 3. The DAG graph contains only 
true data dependence; anti-dependence and output dependence are removed by using 
multi-version variables. 



3.4 DDG Communication Module 

DDG communication module is very simple: it contains only two variables: 
ddg_proc_num, which gives the number of processors, and ddg_my_proc, an 
integer from 0 to ddg_proc_num-l giving the identification number of the current 
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processor. Four functions are implemented in communication module: 
ddg_init_comm ( ) , which initializes communication, sets values for 
ddg_proc_num and ddg_my_proc; ddg_send ( ) and ddg_recv ( ) , which send and 
receive data in DDG buffers; and ddg_f inish_com ( ) , which is called when the 
program finishes. DDG communication can be based on PVM or MPI library. An 
example of DDG communication module based on PVM is in Fig. 4. As the 
communication module of DDG is very simple, porting to MPI or other 
communication libraries can be done in some minutes. 

int ddg_proc_num; 
int ddg_my_proc ; 
ddg_init_comm ( ) 

{ pvm_conf ig (&ddg_proc_num, &narch, &hi) ; 

ddg_myproc= pvm_joingroup (ddg_group) ; 

} 

ddg_send(int dst, ddg_buffer &buffer) 

{ pvm_initsend(PvmDataInPlace); 

pvm_pkbyte (buffer . data, buf f er . size ( ) , 1); 
pvm_send (pvm_gettid (dst) , 1) ; 

} 

ddg_f inish_com ( ) 

{ pvm_exit(); 

} 



Fig. 4. DDG communication in PVM 



4, Case Study 

For demonstration of DDG capability, we applied DDG for Gaussian elimination 
algorithm, which has static behavior, nested loop and data parallelism. The study does 
not only show performance of DDG, but also introduces DDG Application 
Programming Interface (API), because detailed describing DDG API cannot be 
included to this article. All experiments are performed on a PC cluster of 6 Pentium 
500 connected by 100Mb Ethernet. 
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Fig. 5. Sequential Gaussian elimination algorithm. 
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The sequential Gaussian elimination algorithm (GEM) is described in Fig. 5. We 
concentrate only on GEM, the input and output functions (init ( ) and print ( ) ) are 
not considered. The tasks are defined from the lines inside two outer loops, (line 7, 8 
and 9 in Fig. 5). Before defining task, the code of the tasks has to be moved to a 
function (Fig. 6). Finally, the task is created from the code (Fig. 7). In order to use 
DDG multi-version variables, ddg_var<T>, where T is standard or user-defined type 
in C/C++, is used instead of T. For simple type T, ddg_var<T> can be automatically 
converted to T, otherwise ddg_var<T> . get ( ) has to be called explicitly. The access 
rights of variables are defined by function ddg_ro ( ) (read-only) ddg_rw ( ) (read- 
write) and ddg_wo ( ) (write-only). All DDG API function and variable names have 
the prefix ddg_. It is easy to see that the programs in Fig. 6 and Fig. 7 have the same 
structures. DDG API using PVM library for communication. 



1 . 


#define N 1200 




2 . 


main ( ) 




3 . 


{ float a [N] [N] ; 




4 . 


init (a) ; 




5 . 


for (int i = 0; i < N-1; 


: i + +) 


6 . 


for (int j = i+1; j 


< N; i++ 


7 . 


compute (i, a [i] , 


a [ j ] ) ; 


8 . 


print (a) ; 




9 . 


} 





10. void compute(int i, float i_line [N] , float j_line [N] ) 

11. { coef = i_line [i] / j_line [i] ; 

12. for (int k = i+1; k < N; k++) 

13. j_line [k] = j_line [k] - coef*i_line [k] ; 

14. } 



Fig. 6. Intermediate step of DDG task definition 
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#include "ddg.h" 

#define N 1200 

typedef float vector [N] ; 

void ddg_main() 

{ ddg_var_array<vector> arr(N); 
init (a) ; 

for (int i = 0; i < N - 1; i++) 
for (int j = i+1; j < N; j++) 

ddg_create_task (compute , ddg_direct (i) , 

ddg_ro (arr [i] ) , ddg_rw (arr [ j ] ) ) ; 

print (a) ; 



raid compute (ddg_var< int > line, ddg_var<vector> i_line, 
ddg_var<vector> j_line) 

( coef = j_line.get() [line] / i_line.get() [line] ; 
for (int k = line + 1; k < N; k++) 

j_line.get() [k] = j_line.get() [k] - 

i_line.get() [k]*coef; 



Fig. 7. DDG version of GEM 
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Execution times (in milliseconds) of DDG and equivalent PVM version of GEM 
are shown in Table 1. We can calculate the computational overhead of DDG program 
by executing it on a single processor. The percentage of DDG overhead is (12874- 
12789)/12874 = 0.0066, it means that DDG overhead is smaller than 1%. The 
speedup of DDG version on 6 processors is about 3.6. In comparison with PVM 
version, DDG version is less than 1% slower but the source code of DDG version is 
much shorter, easier to understand. It is very similar to the source code of the 
sequential program so porting existing sequential programs to DDG is done with 
minimal difficulty. 

Table 1. DDG performance for GEM 



Processor 


DDG version 


PVM version 


Sequential version 


1 


12874 


12804 


12789 


6 


3576 


3542 





5. Conclusion 

Data Driven Graph, a new model for parallel program development, provides a new 
approach for parallel programming in message-passing systems with integrated 
scheduling. DDG API allows programmers to write robust, efficient parallel programs 
in DDG with minimal difficulties. Experiments with DDG on PC clusters with PVM 
communication showed that programs in DDG are simple, efficient and with minimal 
overheads. 
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Abstract. To provide high-level graphical support for developing 
message passing programs, an integrated programming environment 
(GRADE) is being developed. GRADE provides tools to construct, ex- 
ecute, debug, monitor and visualise message-passing based parallel pro- 
grams. GRADE provides a general graphical interface that hides low- 
level details of the underlying message-passing system thus, it allows 
the user to concentrate on really important aspects of parallel program 
development such as task decomposition. 

The current paper describes the translation mechanism that is applied 
in GRADE to generate the executable message-passing code from the 
high-level graphical description of the user application. 



1 Introduction 

The message-passing paradigm for implementing applications on distributed 
systems (including network of workstations and massively parallel computers) 
closely corresponds to the way in which data are actually moved around in a dis- 
tributed memory computer. Thus, message-passing libraries can be implemented 
very efficiently in such systems. Moreover, with the advent of PVM and MPI 
the portability level of such applications has been raised significantly. Neverthe- 
less, the lack of real user-friendly support for development of such applications 
prevents most of the potential users from dealing with concurrent programming 
at all. 

To cope with the extra complexity of parallel programs arising due to inter- 
process communication and synchronization, we have designed a visual program- 
ming environment called GRADE (Graphical Application Development Environ- 
ment). Its major goal is to provide an easy-to-use, integrated set of programming 
tools for development of message-passing applications that can run either on a 

* This work was partially funded by Mexican- Hungarian Intergovernmental S&T 
project MEX-1/98 and by the Hungarian Science Research Fund (OTKA) Contract 
No.: T032226. 
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real parallel computer or on a heterogeneous cluster of workstation. Most im- 
portant features of GRADE can be summarized as follows: 

— All process management and inter-process communication activities are de- 
fined visually in the user’s application. Graphics assist to better understand 
the complex structure and run-time behaviour of the distributed program 
even for users not familiar with parallel programming. 

— Low-level details of the underlying message-passing system are hidden. 
GRADE generates all message-passing library calls automatically on the 
basis of the visual code. This approach has two basic advantages: the pro- 
grammer is not required to know the syntax of the MP library and the same 
user application is able to run in different MP environments provided that 
GRADE can generate the code for those environments. Gurrently, GRADE 
can generate code for PVM and MPI. 

— Local computations of the individual processes can be defined in G (or 
in FORTRAN in the future) independently from the visually supported 
message-passing related activities. Thus, GRADE provides a comfortable 
environment for parallelizing existing sequential applications. 

— Gompilation and distribution of executables of user’s processes are performed 
automatically in the heterogeneous environment. 

— A distributed debugging 0 and an execution visualisation tool 0 are pro- 
vided that are fully integrated into the common graphical user interface of 
the system. Debugging and monitoring information is related directly back 
to the user’s graphical code during on-line debugging and visualisation. 

Graphical notation used in GRADE is called GRAPNEL (GRAphical Process 
NEt Language) 0. The current paper explains how the high-level GRAPNEL 
applications are translated into pure text code by the system. 

Rest of the paper is organised as follows. Various layers of GRAPNEL ap- 
plications are described in the next section followed by some words about the 
persistent (i.e. text) representation of the graphical code. Actual translation of 
GRAPNEL code into G files is explained in Sect. 0] Finally, the paper ends with 
some conclusions. 



2 Layers of GRAPNEL Programs 

GRAPNEL programs can be represented at several layers. In the current section 
we summarize the role of the various layers and the transformation mechanisms 
between the layers. 



2.1 GRAPNEL Layer 

GRAPNEL provides the top layer of the GRADE system where the user can con- 
struct his/her parallel program by a graphical editor called GRED. At this layer 
the program is represented graphically as described in 0 in detail. The basic 
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idea behind this graphical representation is the following. Two hierarchical level 
of the graphical code is distinguished: application level and process level. At the 
application level, the communication graph of the whole application is defined 
graphically, where processes are represented as nodes connected by communi- 
cation channels. At the process level, communication operations (i.e. send and 
receive actions) are defined by visual means for each process. In fact, the top- 
level control-flow of each process is defined as a graph containing every message 
transfer operation as various nodes. 

For illustration purpose. Fig. ^ depicts a sample Application and Process 
window of GRADE. They are explained in detail in Sect. 14. ll 

This representation is easy to understand for the program developer but it 
is difficult to interpret by programs like parsers. Because of this difficulty GRED 
editor saves the graphical program in a plain text file, called GRP file, which 
is used by the programs and utilities of the GRADE system. GRED is also able 
to read back GRP files and restore graphical representation on the screen. The 
GRP file is an internal form of the GRAPNEL program containing information 
on both the graphical and textual parts. Brief description of the GRP file is 
given in Sect. 01 



2.2 C-Source Layer 

The GRP file is translated into G-source by the GRAPNEL pre-compiler called 
GRP2C. The goal of this translation is that all the graphical information which 
represent G code should be replaced with the equivalent G source code. However, 
those graphical information that are relevant only for drawing the GRAPNEL 
graphs on the screen without representing any G code (for example X-Y co- 
ordinates of graphical nodes) are omitted during this translation. Notice that, 
meanwhile the GRP file is completely equivalent with the original GRAPNEL 
code, the G-source generated by the GRP2C pre-compiler is not. 



2.3 GRAPNEL API Layer 

Because the communication layer upon which GRAPNEL programs run can be 
implemented by different kinds of message passing systems, an other software 
layer is required which hides dependencies of the communication layer. This layer 
is an Application Programming Interface and because its physical representation 
is a G library, it is called as GRAPNEL Library. This API layer can support any 
kind of message passing system, e.g. PVM, MPI, or an operating system directly. 
This API consist of GRAPNEL (or shortly: GRP) functions and higher layers of 
the GRADE system and particularly, the generated G-sources use these GRP 
functions to start processes and sending messages. 

GRAPNEL API is the lowest layer which is really included into the GRAP- 
NEL system and is developed by the GRADE team. An API for PVM and MPI 
is already available and support for other systems such as QNX p] operating 
system is under development. 
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2.4 Message Passing Layer 

This layer should be a widely used communication system. PVM or MPI is a 
good choice because they are ported to many operating systems. Because this 
layer hides operating system dependencies the GRADE system can be hardware 
and operating system independent. 



3 Persistent Representation of GRAPNEL Applications: 
GRP Files 

GRAPNEL applications are represented by mixed graphical icons and textual 
code segments on the screen. In order to store such applications on the disk or 
to produce the executable code of them they are saved into the so called GRP 
files. GRP files are plain text files that contain all necessary information about 
GRAPNEL programs. The exact syntax of a GRP file is defined in BNF form 
(that serves as input data for the UNIX yacc tool used to generate the parser of 
such files). 

GRP files have human readable format. Information are stored in a well 
structured hierarchical way in them. The top level structure is the “Application” 
that consists of two main parts: “HeaderPart” and “ProgramPart” . They are used 
for storing information related to the whole application and to the individual 
processes, respectively. The “ProgramPart”, in fact, is a list of “Process” sections 
describing each individual process of the application separately. 

The next subsections give a brief summary about how GRAPNEL applica- 
tions are stored in GRP files and what are the real contents of those files. 



3.1 Separation of Information with GRP Files 

GRP files are interpreted by a parser that is integrated into all components of the 
GRADE environment need to extract information from them (e.g. GRED editor 
and the GRP2C code generator). This parser enables a GRAPNEL application to 
be split into several GRP files. Thus, different parts of the same application can 
be stored in separate files. According to the two distinct levels of the graphical 
code, information that must be saved into GRP files can be divided into two 
main groups. The first one concerns the global view of the application includ- 
ing the application level GRAPNEL code while the second one deals with local 
information about individual processes. 

In order to support easy re-use of processes across different applications, 
GRAPNEL code of each process is saved into an individual GRP file separately 
from application level information. In these GRP files, “HeaderPart” section is 
left empty and the process is described as the only one element of the process 
list in the “ProgramPart” section. On the other hand, all application level in- 
formation is stored in the “HeaderPart” section of a separate GRP file in which 
“ProgramPart” section contains no data. 
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As a result, the user can open individual process files belonging to other 
GRAPNEL application to insert those processes into the program being devel- 
oped. Furthermore, it is also possible to save the code of any process individually, 
e.g. to store it in a “process warehouse” directory for later use in other applica- 
tions. 

4 Translating the Graphical Code into C Source Files 

After defining the structure of the GRP files, we show how the GRP2C pre- 
compiler generates standard C source code from the GRP files. The programs 
generated by the pre-compiler can be compiled with any standard G compiler 
and can be executed in the usual wajfl. 

In order to explain the translation mechanism of GRP2C we show the graph- 
ical representation (application layer) of a simple example in Fig. ^ There are 
two processes {'‘slaver and “slave2') computing subtasks and sending the result 
to the third process {“masted) that collects the results and send new subtasks. 
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Fig. 1. Simple GRAPNEL Application 



^ Compilation and distribution of executables are carried out automatically by 
GRADE even in case of heterogeneous distributed execution environment. 
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4.1 C Files Generated by the Pre-compiler 



The GRP2C pre-compiler generates one C source file for every process based on 
a general template and on the GRP description of the particular process. Every 
such source file starts with include section. It depends on the information located 
in the HeaderSection of the GRP file. Next part is the definition part of the global 
variables. This part and the beginning part of the main function — where local 
variables and the channels are defined — are included to the template by the 
pre-compiler. After variable definitions, the pre-compiler inserts G instructions 
into the template. 

Several GRAPNEL Library calls will be inserted in front of the code defined 
by the user. These system calls register the start of the process and initialize the 
channels used by the process. 

The functionality of a process is defined by the programmer and it is repre- 
sented by the different nodes of the process graph. These nodes are called graph- 
ical blocks. Every block represents a small piece of the executable code and the 
connections between the blocks define the order of execution. The blocks must 
be translated into G code in the appropriate order. There are several types of 
the graphical blocks and some of them must be processed recursively. 

Let us see as an example the master process of Fig. d Its user-defined graph 
structure is depicted in the Process Window in the figure. It consists of a loop, 
an alternative input operation (lAl) inside the loop and a conditional execution 
of two sequential blocks (SEQ2 and SEQ3) followed by two output operations 
(01 and 02). Gode generated for the various graphical symbols by GRP2C is 
explained as follows. 



~ — ^ — The G code of loop_start and loop_end must be attached by the 
programmer to “loop begin” and “loop end” blocks, respectively. Thus, GRP2C 
simply substitutes loop_start and loop_end in the pseudo code above with the 
appropriate user supplied G code segments. Nested loops are kept track by the 
code generator using an internal stack. Godes of blocks between “loop begin” and 
“loop end” are placed in a compound statement after the code of loop_start. 






Gommunication operations such as input, output, and alternative 



input blocks require several GRAPNEL API function calls to be included into 
the generated file. GRAPNEL API provides three different communication opera- 
tions independently from the underlying low-level message-passing layer: SEND, 
REGEIVE and ALTERNATIVE REGEIVE both in synchronised and not syn- 
chronised forms. 

The GRP2C pre-compiler places different functions in the generated programs 
in order to compose messages, and to send and receive them. Gomposing a 
message means picking up all the required data and packing them together into 
a message. To produce proper API calls, the pre-compiler must know the type 
and name of the variables which the programmer would like to use as source or 
destination of the data to be transferred. So, the user must attach the type of 
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messages being sent through a channel and the name of the variables taking part 
in the communication operation to the appropriate port icon and communication 
block, respectively. Communication blocks in the graphical code are connected 
to one or more of the available ports by the user thus, the pre-compiler can put 
all the required information together. 

[SEND] For send operation the pre-compiler produces one or more pack API 
function calls to pack variables into the message. The next function call generated 
by the pre-compiler is the send function which sends the prepared message to 
the addressed process. This function call accepts a parameter which specifies if 
the process should be blocked on this send operation or not. 

[RECEIVE] For receive operations GRP2C produces one function call which 
receives the message and one or more unpack API function calls to pick up data 
fragments from the message and place them in variables. 

[ALTERNATIVE RECEIVE] The alternative receive operation is similar to 
the simple receive one but it accepts a message selectively via more than one 
ports. The pre-compiler generates different unpack API function calls for different 
ports but they are placed in a switch instruction and the right one is selected 
based on the number of port on which the actual message arrived in. 

> 

^ Conditional block represents a conditional (i.e. ifO) statement. Any 
of the TRUE or FALSE branch can be empty. The C code of the if 0 state- 
ment must be attached to block “cond begin” by the programmer, and the 
pre-compiler includes it into the generated file. The pre-compiler then gener- 
ates both non-empty branches as compound statements. Nested conditionals are 
handled using the internal stack mentioned earlier at loop construct. 

Translation of sequential blocks is simple. The pre-compiler simply 
includes the source code attached to the block SEQ by the programmer into 
the generated file. Any source code except communication can be placed in a 
sequential block. Size of the attached code is unlimited and it can contain calls 
to any existing library written even in languages different from C (for example 
FORTRAN). The syntax of the GRP file enables the graphical editor to store 
large source code fragments in a separate file and to mention only the file name 
in the GRP file. 

I I There is a special kind of block called graph block which is not shown 
in our simple example. It can be used in more complex programs to simplify 
graphical representation of the process control flow. A graph block represents a 
sub-graph, i.e., any subpart of the process graph can be packed and hidden by 
a graph block. In the graphical editor it can be opened and edited by the user. 
When the pre-compiler finds a graph block it simply starts to generate code of 
list of blocks represented by the graph block. 

5 Conclusions 

Availability of powerful programming environments for heterogeneous networks 
is getting more and more important. GRADE provides an integrated program- 
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ming environment where the programmer can concentrate on high level abstrac- 
tions without worrying about the low level details of communication primitives. 
Through its graphical user interface, GRADE provides efficient support for the 
most important and time consuming phases of parallel program development: 
rapid prototyping and correctness/performance debugging. 

Comparing it with other visual programming environments have been de- 
veloped so far (e.g. TRAPPER 0, CODE and HeNCE 0), GRADE exhibits 
significant advantages discussed, for instance, in [^. 

Currently the GRADE environment supports PVM and MPI as target sys- 
tems and it runs on UNIX hosts. GRAPNEL Compiler generates C source files 
from graphical representation of the program. Graphical symbols are language 
independent so it is possible to modify the translator tool to generate source 
files for other programming languages. The development team is going to sup- 
port Fortran language which is still very important in high performance comput- 
ing business. New GRAPNEL API implementations are going to be developed 
as well to support more message passing systems for example QNX jOj operat- 
ing system. Supporting QNX operating system can be important for industrial 
real-time applications. 
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Abstract. This paper describes HeSSE, a research project whose objective is 
the development of a simulator of heterogeneous systems starting from an 
existing simulator of PVM applications. After a discussion on the main issues 
involved in dealing with a wider class of applications, the devised simulator 
design is described. Finally, the state of the art of the project is presented. 



1 Introduction 

Thanks to the availability of high speed networks and reliable run-time supports, 
almost the totality of the computing systems currently used for message-passing 
applications is heterogeneous. As far as the design and development of applications 
targeted at these systems is concerned, the main problem is the absence of any 
consolidated technique or tool. As a matter of fact, the approaches followed for 
development in the small, homogeneous environments mainly used in the past are 
totally inadequate to tame the complexity of contemporary computing environments. 
Moreover, the subtle effects of computer resource and network heterogeneity further 
complicate the traditionally difficult task of application performance evaluation and 
tuning [1]. Therefore, performance evaluation techniques that are more sophisticated 
and cost-effective than those currently available are needed. 

In the last few years, our research group has been active in the performance 
analysis and prediction field, developing PS [2], a simulator of distributed 
applications executed in heterogeneous systems using the PVM run-time system [3]. 
This experience led us to discover the high potential of simulation tools for 
application development [4], performance prediction and tuning. In particular, among 
the three customary approaches to performance analysis, namely monitoring, 
analytical models and simulation, only the third seems able to provide reliable 
performance predictions of complex systems. We think that the possibilities to get 
reasonable information on application behavior and response time even at the earliest 
stages of the development process (possibly in the absence of a fully-developed 
program), to compare different algorithms and workload sharing policies on real, 
fictitious or unavailable target machines, are worth the effort to learn to use a new and 
relatively unusual development environment. 
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Being unsatisfied of the friendliness and ease of use of PS, at the end of 1998 we 
decided to develop a new version of the simulator. PS was heavily dependent on the 
Ptolemy graphical simulation environment [5], whose powerful facilities were not 
used at a great extent. Its second fundamental drawback was the lack of a module 
modeling the computing node scheduler, hence the impossibility to simulate 
applications with more than one task per processor and to take easily into account the 
effect of the load due to external processes. We studied the possibility to widen the 
range of systems that could be successfully simulated. The result was the decision to 
develop, instead of a new version of PS, a modular, extensible simulator of the 
hardware and software objects making up current heterogeneous distributed systems, 
with support for the most commonly used programming environments (PVM, MPI, 
socket-based). 

However, the transition from a parallel program simulator as PS to a complete 
distributed heterogeneous system simulation environment was not just a matter of 
developing new simulation objects modeling additional hardware or software agents. 
In simulators, speed is obtained at the expense of accuracy. The developer of a 
simulation environment has to choose which characteristics of the phenomena to be 
modeled are fundamental (in that they have a direct influence on system response), 
and which can instead be neglected without a significant loss of simulation accuracy. 
Of course, this choice is tied to the particular class of applications to be simulated. In 
other words, it is not possible to develop a general-purpose simulator, but rather one 
where an optimal trade-off between simulation speed and accuracy has been made for 
a particular (albeit wide) class of applications. 

PS was developed with scientific applications made up of coarse tasks running over 
relatively slow networks in mind. As a consequence, such details as the timing 
behavior of the underlying operating system, TCP/IP stack and I/O devices, the effect 
of message forwarding through daemons (typical of the PVM environment) were 
systematically ignored, being “hidden” behind long CPU and message transmission 
bursts. If the objective is the simulation of more finely grained applications, namely 
applications where computing and message transmission time do not systematically 
hide O.S., I/O and TCP/IP times, different choices have to be made. 

We will discuss hereunder these and other issues linked to the transition from the 
existing PS simulator to the new simulation environment, which has been named 
HeSSE (Heterogeneous System Simulation Environment). The paper is structured as 
follows. First we discuss the motivations and the design issues of the HeSSE 
simulator. Then its structure is sketched, showing one of its most peculiar 
characteristics, the dynamic environment configuration capability. Finally, the state of 
the art of the project and the objective of our future research are presented. 



2 Design Issues for a Heterogeneous System Simulator 

Upgrading the design of PS to a (almost) general-purpose simulator of heterogeneous 
distributed systems is not just a case of developing objects modeling new hardware 
and software components not available in the old simulation environment. Below we 
will consider some of the involved issues, whereas the adopted solutions are discussed 
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in the next Section, where HeSSE structure is described. We will not instead describe 
the structure of the old simulator, which was presented in [2,4], 

Probably the first question that must be answered is what to preserve of the old 
simulator design. As compared to other last-generation simulators [6-8], the 
distinctive features of PS are the use of input traces collected through a preliminary 
execution in a software development host, and the production of output in the form of 
simulated execution traces, which can be post-processed to get performance indexes 
or summaries of program behavior. Several years of use of the various PS prototypes 
have shown the substantial validity of this approach. A trace is essentially a sequence 
of snapshots of one particular program execution, the traced one. Therefore, traces can 
hardly ever be useful for debugging purposes, not to mention detailed analysis of non- 
deterministic programs. They are instead fully satisfactory for performance evaluation, 
as performance behavior is not at a great extent dependent on the particular program 
execution or path followed in the code. The trace-based simulation cycle has proven to 
be simple, friendly and easily understood by simulator users. 

As for the tradeoff between accuracy and simulation speed, it should be noted that 
PS simulation is based on a very simplified view of program execution, which is 
modeled by predicting in a reasonable (but not particularly accurate) way the duration 
of CPU bursts, i.e., of the intervals of time spent computing between two successive 
interactions with the PVM run-time support. PS converts the CPU-time intervals 
extracted from the traces into the (predicted) duration of the corresponding CPU 
bursts in the target machine. The method used relies on a simple analytical model of 
the target computing system, which essentially takes into account the difference 
between the processing speeds of the host and the target for the given problem, 
evaluated beforehand by running suitable benchmarks [2]. Whereas, all process 
interactions through the network, data exchanges included, are fully simulated. This 
modeling structure, represented graphically in Fig. 1, has turned out to be successful 
for the simulation of most PVM applications, making possible to obtain fairly accurate 
results (errors typically less than 5%) with modest computational effort. 
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Fig. 1. Conversion from traced timing to simulation time in PS 

In light of this experience, there no good reason to adopt different solutions for the 
new simulator. The problem with PS is that it models monoprogrammed nodes, where 
exactly one PVM task is executed per node with the support of PVM daemon and 
operating system. Daemon and O.S. are not actually simulated, as all the times spent 
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in system activities (PVM daemon, O.S. support, TCP/IP) are modeled as a whole. 
Only simulators targeted exclusively at coarse-grained, computation-intensive 
processes can get reasonable accuracy without a simulation model of operating system 
services, I/O and interrupt processing. Looking at existing simulators, we find at one 
end of the range of possible design choices simulators as PS and Dimemas [6], which 
ignore (or model very roughly) all CPU time not spent in application processing. At 
the opposite end, there are simulators as SimOS [9] which, instead, perform a 
complete emulation of the entire operating system and any attached I/O device. In the 
first case, reasonably high simulation speed can be obtained, but accuracy can be 
satisfactory only for CPU-intensive application. In the latter, accuracy can be fairly 
high for any type of application, including I/O intensive ones, but simulation is very 
slow, even if a single computing node is simulated. In practice, at the state of the art, 
the latter solution is not applicable for the simulation of large complex systems. As 
will be shown in the Section that follows, HeSSE is based on an intermediate 
approach, as it adopts a simplified model of system node activities. 



3 The HeSSE Simulator Structure 

In order to be relatively light-weight and highly portable, HeSSE does not rely on the 
use of a complete graphical simulation environment as PS, but is based on a fast 
simulation library written in C-t-f. The simulator software has been designed in order 
to make it easy to develop new components, or even to use existing simulators as 
components modeling new objects and networks. To boost as much as possible the 
modifiability, reusability and extensibility of the simulator design, the software system 
has been developed by using an object-oriented paradigm. 
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Fig. 2. HeSSE schematic structure 

In each simulation session, a suitable simulator configuration has to be set up using 
as building blocks the components corresponding to the objects making up the real 
hardware/software system. This task is carried out by the Configurator, which, as 
shown by the H-like diagram chosen for the project logo (Fig. 2), is the “heart” of the 
simulator. Our experience in the PS project, where the configuration was set up “by 
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hand” using the Ptolemy graphical interface, helped us to understood the fundamental 
role played by the configurator in the simulation of large, complex systems. In HeSSE, 
the configurator reads all simulation input data, using a Configuration and a 
Command file. The first describes all the components that are to be used and their 
interconnections, whereas the latter contains the names of the trace application files 
and the temporal parameters for the configurable components (e.g., relative node 
processing speed, network bandwidth, ...). The traces are to be produced beforehand 
by running the fully-developed program on any possible sequential or parallel 
hardware configuration (most typically, on a single workstation), or built synthetically 
by means of a program skeleton and the expected times spent in every section of 
sequential code [2]. The Configurator instances all the required objects 
communicating them any required temporal parameter, and starts the simulation run. It 
is important to note that the target configuration can be dynamically altered, in that the 
Configurator can set up all the components to be used and their interconnections 
without recompilation (nor even relinking) of the simulator code. 

A simulation session in HeSSE is represented graphically in Fig. 3. The trace files 
of the events relative to each task are used to drive the simulation engine. The duration 
of each CPU burst, extracted from the trace file, is processed in order to derive the 
duration on the final target, thus taking into account the effect of different machine 
speeds. As mentioned before, task interactions with O.S., i-un-time supports, I/O 
devices and networks are instead dealt with by simulation. The events and the 
timestamps representing the (simulated) execution of the program on the target 
environment are written to an output trace file. This file can be filtered and converted 
into the format required by virtually any program visualization tool. Subject to time 
resolution constraints, the availability of a trace file of the simulated execution allows 
any possible performance index to be evaluated with small effort. 
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Fig. 3. A HeSSE simulation session 



To understand how the HeSSE simulator works, it is worth to sketch its internal 
structure, showing which are its main modules and how the simulation task is shared 
among the various simulation components. In general, the simulation of a 
heterogeneous system requires the simulation of processing nodes, network and 
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application software. In HeSSE, this is carried out by the Node, Network, and 
Application components, respectively, as shown in Fig. 2. Below we will orderly 
examine each of these tasks. 



3.1 Processing Node Simulation 

As far as the node activities modeling is concerned, HeSSE is based on an approach 
halfway between the one of PS (no O.S. and I/O simulation) and that of SimOS (full 
O.S. emulation). O.S. service times and I/O devices are not taken into account by 
complete emulation, but using a simplified model of the node activities. The temporal 
parameters used by this model (e.g., the times spent in O.S. calls) are to be measured 
beforehand on the particular combination of hardware and O.S. to be simulated. The 
objective is to obtain reasonable accuracy for the majority of heterogeneous system 
applications, while retaining a simulation structure of tractable complexity. 

The node activities simulated in HeSSE can be logically divided into three groups: 
Process, Interrupt and Message-exchange Management. As regards the first group, 
the Node is a sort of macro-component which includes all the hardware/software 
components available in a processing node. It can be used by the simulated application 
processes to ask the Operating System for services. In fact, the Node is essentially an 
interface to three internal components which simulate processor {CPU), pre-emptive 
scheduler {Scheduler) and operating system kernel behavior {Kernel), respectively. 
Among other things, the Node provides the processes with a function that makes it 
possible to create a new process, registering it with the Scheduler and CPU objects. 

The Interrupt Management activities rely on a Driver object that is used as base 
class to implement the components that allow O.S. interaction with EO Peripherals. 
Driver components are used as follows. An I/O device can register with a Node 
component for a given interrupt. At registration (which is carried out in the 
initialization phase by the Configurator), the Driver associated with the I/O device is 
also connected to the Node. From that moment onwards, the Driver will be able to 
accept “interrupf’ signals on a mailbox. Upon reception of such a signal, the CPU 
object stops its activities, changes processor mode to supervisor and executes an ISR 
defined by the Driver. At ISR completion, another signal awakes the CPU, which 
enters once again user mode and resumes (simulated) processing. 

The Message-exchange Management module is used to allow coordination and 
communication among applications processes. This is obtained through a fairly simple 
message-exchange support relying on a non-blocking Send and a blocking Receive 
primitive. These primitives are only used to coordinate the (simulated) process 
execution, not to simulate real interprocess communications. 



3.2 Network Simulation 

At the state of the art, the only networks that can be simulated by HeSSE are single- 
bus Ethernet or Fast Ethernet. Components that will allow the simulation of Routers, 
Myrinet and ATM networks are under development. As far as Ethernets are 
concerned, a first component {Cable802 _3) simulates the behavior of a Ethernet cable 
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at 10 or 100 Mb/s, making it possible to connect multiple stations and to detect 
collisions. A second component {Ethernet) simulates the behavior of the network 
interface card, implementing the CSMA/CD transmission protocol and the reception 
of the packets sent to its own Ethernet address or to all receivers. Every Ethernet is 
associated with a Driver, which acts as interface to the Kernel and moves data in 
transit between the network and the destination process (in the two directions). 



3.3 Application Process Simulation 

Our objective in the development of the very initial version of HeSSE has been to 
mimic the functionality offered by PS (with higher simulation accuracy, of course). At 
application level, the only (distributed) process interaction that can currently be 
simulated is the one offered by the PVM run-time support. We hope to implement 
soon the components needed to simulate other supports (in particular, MPI) and the 
data exchange through sockets typical of network applications. By the way, it is worth 
pointing out that PVM is probably the most complex of all the three mentioned cases, 
since it is the only one where messages (unless explicitly directly-routed) are 
forwarded to their final destination through daemon processes. 

The simulation of PVM processes relies on three components: the PVM daemon 
{PVM Daemon), a manager of the physically-distributed data common to the whole 
network of PVM daemons {PVM_Data), and a component representative of PVM 
application tasks {PVMJTask). PVM Daemon is connected to the Node, to 
PVM Data and to all the PVM application tasks running in the same computing node. 
During the initialization phase, the PVM Daemon component asks the O.S. for the 
creation of a daemon process, which sleeps until it is scheduled for execution 
following up the reception of a PVM message directed to that node. 

The PVMJTask component models one application process. Each Node can be 
connected to one or several instances of this component, which are scheduled along 
with other possible user processes. During the simulation, all PVMJTasks loop 
reading records from the trace files of the corresponding processes. These records may 
correspond to CPU bursts or to interactions with the node O.S.. In the first case, 
PVMJTask asks the Node for CPU time (more detailed, for a CPU time equal to the 
duration of the CPU burst converted to the expected duration on that node). In the 
latter, an O.S. service is requested and simulated by the Node sub-components. 



4 Conclusions 

This paper is essentially a preliminary report on the HeSSE simulation project. We 
have described the main reasons behind the decision to develop a new simulation 
environment, instead of upgrading an existing one targeted at coarse-grained PVM 
applications. An important point dealt with here is that a simulator design should 
match the class of applications that can be successfully simulated, making an optimal 
tradeoff between speed and simulation accuracy. 
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We have described some of the features of the old simulator we have decided to 
preserve in the new design, and discussed the design constraints linked to the necessity 
to simulate a wider class of systems and applications. Then we have shown the 
structure devised for HeSSE, describing the functionality of the main simulation 
components and discussing its dynamic configuration capabilities. 

At the state of the art, an alpha version of the simulator targeted exclusively at 
PVM applications has been implemented and it is currently under testing. We are 
trying to ascertain how wide is the class of applications that can be simulated with 
reasonable accuracy. This process should give us the feedback required to evaluate the 
validity of the simulator design, making it possible to revise promptly any possible 
unsatisfactory choice. 
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Abstract. Distributed application suffer from nondeterminism thus may 
behave in a different way for subsequent executions with the same in- 
put. To be able to ensure determinism of replay the sequence of received 
messages should be recorded for each process. The paper deals with com- 
parison of various strategies for tracing PVM programs. It concerns cen- 
tralised and distributed approach for tracing as well as techniques with 
and without race detection. 



1 Introduction 

Rapid growth of distributed applications employment caused vast demand on 
development of mechanisms supporting this type of computing. The very impor- 
tant area is recovery in a broad aspect covering not only dependability problems 
like fault tolerance, testing and debugging but also visualisation |SI and 

modification of computations as well as application of control procedures (e.g. 
in financial systems). 

This article addresses development of strategies of tracing distributed ap- 
plications in PVM. Finding a proper strategy of tracing application behaviour 
is critical in reducing the recovery overhead, especially storage and time over- 
head uni. In the previous work an approach to trace PVM applications 
involving a race detection procedure was presented. This paper continues that 
research. Other tracing strategies are compared to the one mentioned in the ear- 
lier article. The stress is put on the time and storage efficiency. Term “recovery 
of computations” is used here paper as a generalisation of execution replay. It 
covers two areas: 

— backward state recovery techniques, 

— re-execution mechanisms. 

State recovery as well as re-execution depend on mechanisms of tracing ap- 
plication execution in order to log information needed by recovery procedures. 
This paper concerns tracing techniques supporting re-execution. Checkpointing 
techniques supporting roll-back procedures are well-defined by other authors mi 
and will not be discussed in this work. 
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Definition 1 (Recovery of computations) 

Let Eij be an execution of a distributed application from its state S(ri) to S(rj), 
and S(Tk) be an intermediate state of this execution (ti < Tk < tj). Recovery of 
computations is defined as a transformation of S{Tj) into S'(rk) by rolling the 
application back to S'{Ti) and re-executing it. 

Rjk ■ R{Tj) ^ ^jk = {Rji + Eik) (1) 



S,i 



'^k I 



Ei- 



Fig. 1. Recovery of computations Rjk 



According to this definition, three phases of recovery of computations may 
be distinguished (see figure D : 

— primary execution Eij, 

— roll-back Sji, 

— re-execution (replay) 

Each of these phases incorporates its own recovery procedure. For the pri- 
mary execution this is a tracing procedure. For the second phase this is a roll-haek 
proeedure, while for the re-execution this is a replay management proeedure. How- 
ever the tracing procedure precedes the recovery itself, it plays very important 
role when overhead is concerned. Overhead caused by the tracing procedure 
should be very carefully analyzed not only because additional cost of execution 
but also because of influence on replayed application behaviour what is very 
important especially in testing of distributed software non]. 

2 Problem of Nondeterminism 

The basic problem of replaying distributed applications is their nondeterministic 
behaviour. An application is said to be nondeterministic if its behaviour may be 
different for subsequent executions with the same input. 

Nondeterminism is caused by interactions with the environment. Its sources 
may be some system calls (e.g. random ()), non-initialized variables, interrupts 
and signals [2| . Distributed applications also suffer from nondeterminism caused 
by message exchanges. While methods of dealing with nondeterminism caused by 
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sources that affect also sequential programs are well known pOj, the problem con- 
cerning distributed applications is much harder to solve. This article addresses 
logging information during primary execution in order to ensure deterministic 
replay. Only the problems specific to distributed applications are considered in 
this work. 

Typical method of achieving determinism of replay is recording the order 
of receiving messages by logging their control data during primary execution. 
Logging sender tid and tag of each received message is sufficient in PVM en- 
vironment [7IT^ . In case of nonblocking receive (pvm_nrecv() , pvm_trecv() ) 
the data is recorded provided that a message was successfully received. During 
replay parameters of receive functions (pvm_recv(), pvm_precv()) are substi- 
tuted with the data recorded in the log nni . The nonblocking receive functions 
(pvm_nrecv() , pvm_trecv()) are replaced by the blocking pvm_recv() in case 
of successful reception of a message in the primary execution. In the opposite 
case, the nonblocking function is ignored. 

In the next section various tracing strategies are analysed with respect to the 
PVM characteristics. 

3 Tracing Strategies 

There are two general strategies of tracing execution of PVM program: cen- 
tralised and distributed. The idea is either to create one centralised log file 
or a collection of files distributed over the virtual machine. The advantage of 
the centralised approach is the opportunity to perform on-line analysis of the 
recorded data. However this strategy increases communication overhead due to 
sending data to a centralised log manager. On the other hand distributed logging 
does not introduce communication overhead while the recorded data needs to be 
merged before it may be analyzed. Volume of time overhead in both strategies 
is subject to analysis. In case of distributed logging it depends on the hardware 
(hard disk technology, type of interface, etc.) and operating system mechanisms 
(disk caching). Considering parameters of present systems and relatively small 
size of portions of recorded data, expected time overhead should not be very 
significant. In section 0 results of experimental comparison o the two strategies 
will be presented. 

As stated in section El to achieve determinism during replay, we need to 
record the order of receiving messages. However it is not necessary to trace all 
the messages. The storage and time overhead may be reduced by tracing receive 
functions accepting messages from any process. Only those receives may cause 
races. This approach is called optimistic. However log generated with optimistic 
approach may still be redundant for the reason that not every wild-card receive 
generates a race. To reduce amount of recorded data to the racing messages 
only a race detection procedure should be used. A tracing strategy based on 
race detection is called pessimistic. Netzer and Miller propose an elegant 
solution to detect races “on the fly”. But their method based on tracing the 
second message involved in a race cannot be easily adopted in PVM. In this 
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work an alternative method based on tracing the first racing message was used. 
This approach was described in m- Applying a race detection procedure we 
can limit size of the log file. On the other hand we increase time overhead. In 
section ^jexperimental results will be presented to compare overhead introduced 
by different variants of the tracing procedure. 

As mentioned in m race detection mechanism should be implemented with 
centralised tracing. This allows on-line identification of processes and proper 
initialisation of vector clocks, necessary to perform a race detection procedure. 

4 Experimental Results 

Two series of experiments have been performed to compare time overhead in- 
troduced by tracing procedure. The first one compared optimistic tracing in 
centralised and distributed strategy. The other one concerned centralised strat- 
egy with optimistic and pessimistic approach. Three small application^ were 
selected to perform the tests: 

INT - numerical integration, 

LIN - linear recurrence, 

HQS - quick sort in hypercube topology. 

All those applications have been implemented in master-slave model. Each 
one contains receive races either in communication between master and slave 
processes or among slaves. The intensity of communication between slave pro- 
cesses is different in each application. INT does not involve any message exchange 
between slaves while for LIN the communication among them is moderate and 
for HQS is intensive. 

The tests were performed on a switched 10 Mbps CSMA/CD network of SUN 
Sparc 4 machines with Solaris 2.6. To reduce influence of the environment on 
the obtained results, all of the measures were performed 100 times and average 
execution time was calculated. Time overhead ovrtt was defined as a ratio of 
difference between average execution time of application with tracing procedures 
tp and pure application ta to average execution time of pure application tp 0. 

^=^^ = ^-1 ( 2 ) 

Figure 121 presents time overhead of tracing procedures executed on a virtual 
machine consisting of four hosts. The points show measured results while the 
lines show approximated trends. Time overhead of tracing for all three appli- 
cations in case of distributed strategy is small (less than 5%) and may be as- 
sumed constant (the deviation is within the measuring error). Overhead caused 
by centralised tracing linearly grows with number of processes in an application. 
The growth is faster for communication intensive applications. This is justified 

^ The applications originally developed by Roy Pargas and John N. Underwood were 
retrieved from an Internet site and adopted by the author. 
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by larger number of messages sent over the network to the log manager pro- 
cess. There is a significant difference between optimistic and pessimistic tracing. 
It is caused by the race detection procedure, especially by timestamping the 
application messages. Detecting races “on the fly” requires implementing vec- 
tor clocks 0 that are attached to every message sent. PVM applications use 
two kinds of communication functions. Along with the basic pvm_recv() and 
pvm_send() their p- versions may be used (pvm_precv() and pvm_psend() ). 
The difference in their interfaces makes development of generic piggybacking 
mechanism impossible. That is why vector clocks are sent as separate messages 
significantly increasing the time overhead. 

Figure 0 presents the number of messages traced with optimistic and pes- 
simistic approach compared to the number of all processed messages. The dif- 
ference in sizes of log files produced with optimistic and pessimistic tracing pro- 
cedures strongly depends on an application and programming characteristics. 
Figure O shows that for the three test programs applying the optimistic trac- 
ing strategy significantly reduces number of logged messages while pessimistic 
strategy compared to the optimistic one does not introduce such a big difference. 

5 Conclusions 

Basic strategies of tracing an application behaviour were compared both in the- 
oretical and experimental way. The results show that distributed tracing in- 
troduces small overhead while centralised tracing allows to implement a race 
detection procedure that reduces size of the recorded data. However use of race 
detection procedure seem to be justified only in case of long running applications 
where the size of recorded data may be a critical problem. 

The ongoing work will focus on development of hybrid techniques based on 
distributed tracing with centralised logging of process creation information that 
allows to apply a race detection procedure. Also mechanism for attaching vector 
timestamps to messages will be analysed in order to reduce time overhead. 
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Abstract. Two parallel programming models applied to an adaptive fi- 
nite element code for solving nonlinear simulation problems on unstruc- 
tured grids are compared. As a test case we used a compressible fluid 
flow simulation where sequences of finite element solutions form time 
approximations to the Euler equations. 

In the first model (explicit) the domain decomposition of unstructured 
grid is adopted, while the second (implicit) uses the functional decompo- 
sition - both applied to the preconditioned GMRES method that solves 
iteratively the finite element system of linear equations. Results for HP 
SPP1600, HP S2000 and SGI Origin2000 are reported. 



1 Introduction 

Two programming models can be adopted for parallel computing - message pass- 
ing and data-parallel. Usually, the first of them (explicit) offers higher parallel 
efficiency, while the second (implicit) - ease of programming. They reflect or- 
ganization of the address space. Between two extremes, i.e. the shared address 
space organization and the distributed memory architecture, there is a class 
of virtual shared, physically distributed memory organization machines (often 
called Distributed Shared Memory, DSM, machines). The last one offers sev- 
eral typical classes, like cc-NUMA (cache coherent non-uniform memory access), 
COMA (coherent only memory architecture) and RMS (reflective memory sys- 
tems). cc-NUMA implementations are commercialy the most popular. Examples 
come from HP (Exemplar with two interconnection layers), from SGI/Cray (Ori- 
gin2000 with fat hypercube topology) and future SUN computing servers. The 
similar approach is incorporated in the present IBM RS/6000 SP computers with 
PowerPC604e or PowerS SMP nodes. 

Since the advanced multiprocessors at present are constructed with SMP 
nodes the choice between the programming models is not obvious; integration of 
multiprocessing and multithreading would be profitable in the future. 

The standard finite element procedures for solving a given problem consist 
in creation of element stiffness matrices and load vectors, assembling them into 
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a global stiffness matrix and a global load vector and then solving a resulting 
system of linear equations. The latter task is often performed by a separate, 
general purpose library procedure. The parallelization of such a solver is done 
independently of the finite element mesh and the problem solved. Many results 
concerning computational mechanics have been published to date (see for exam- 
ple a general overview [P). 

In the reported case the solvers are built into the algorithm. They use partic- 
ular data structure related to the finite element data structure and do not create 
a global stiffness matrix and a load vector. Instead, they proceed in the element 
by element manner and receive element stiffness matrices. The parallelization of 
such solvers is based on mesh partition and particular data handling. 

In the paper we present an extension of our previous studies I2E0SI. We 
show timing results for a CFD problem obtained on HP SPP1600, HP S2000 
and SGI 0rigin2000 computers. 

2 Algorithm for Flow Simulations 

We used the following variational formulation for the stabilized finite element 



— - the computational domain, Qc C I —2 or 3 

— Hi - the outward unit vector, normal to the boundary df2c 

— U - the vector of conservation variables (p, puj, pe)^, j = 1, .., I (p, Uj and e 
are the density, the j-th component of velocity and the specific total energy) 

— /* - the Eulerian fluxes, /* = {pui, pUiUj +pSij, (pe + p)ui)'^ , {i,j = 1, ..,0 

— p = ( 7 — l)(pe— ipuiUi) - the pressure (7 - the ratio of specific heats, 7 = 1.4) 

— At = - the time step length 

— - nonlinear matrix functions representing stabilization terms and artifi- 
cial viscosity 

The indices i, j have range from 1 to /, the outer superscripts of functions of U 
refer to their actual argument (e.g.: or (/*)" = the 

summation convention is used and differentiation is denoted by 

The problem is discretized in space using triangular finite elements with linear 
shape functions. For time discretization we use a version of the Taylor-Galerkin 



method m-- 

Find 17"+^ G satisfying the suitable Dirichlet boundary conditions 

and such that for every test function W the following holds: 




( 1 ) 



where : 
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time marching scheme Q- ^ sequence of solutions to one time step problem (1) 
constitutes a time discrete approximate solution to the Euler equations. 

At each time step a stabilized finite element problem is solved with GMRES 
algorithm, which is one of the most successful and widely used iterative meth- 
ods for nonsymmetric systems of linear equations EMini. It is preconditioned 
by block Jacobi iterations CD and uses patches of elements in the finite element 
mesh, that define blocks within the stiffness matrix. Only these blocks are as- 
sembled during the solution procedure, overcoming the problem of distributed 
storage of the global stiffness matrix. Matrix-vector products in the precon- 
ditioned GMRES algorithm are performed by means of loops over patches of 
elements and solutions of local block problems. 

In the restarted version, the number of Krylov space vectors, k, is limited 
to some small number (in our case k = 10) and the initial guess for restarts is 
taken as the current approximate solution. 



3 Message-Passing versus Shared- Address Space in 
GMRES Implementation 

In the message-passing programming paradigm, programmers treat their pro- 
grams as a collection of processes with private local variables and the ability 
to send and receive data between processes by passing messages. There are no 
shared variables among processes. Each process uses its local variables and oc- 
casionally sends or receives data from the others. 

We use this programming paradigm to create the explicit parallel algorithm 
in which the whole computational mesh is divided into subdomains assigned 
to different processor. Each processor considers only subdomain internal mesh 
nodes exchanging information on boundary nodes with others processors which 
deal with neighbouring submeshes (data locality is maintained) . The subsequent 
steps of the GMRES algorithm are executed in such a way that except of few 
instructions related to global operations (e.g. calculation of vector norms, inner 
products) each processor performs calculations on local data. Local stiffness 
matrices for block problems in Jacobi algorithm are generated and assembled 
independently and in parallel for each subdomain. 

The practical realization of this model uses PVM for message passing between 
different processing units. There exists one master process that controls the 
solution procedure and several slave processes performing in parallel the most 
of calculations (master-slave model) . 

To obtain minimal execution time for the problem the partitioning process 
must optimize load balance and minimize interprocessor communication. Each 
subdomain should contain the number of degrees of freedom proportional to the 
computational power of processors and minimal number of interface nodes. In 
our study the domain decomposition is performed with a simple mesh partition 
algorithm based on the advancing front Several algorithms have been tested 
P]; the smallest communication overhead is observed for the algorithm which 
ensures vertical alignement of the subdomain interfaces. The strategy results 
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in the smallest number of the interface nodes minimizing the communication 
requirements. 

In the shared-address space programming paradigm (mentioned by DSM 
below), the program is a collection of processes accessing shared variables. The 
shared-address space programming style is naturally suited to (virtually) shared- 
address space muliprocessors. 

Compilers can parallelize a code automatically, verifying loop dependencies. 
Such an approach results in sufficient efficiency for simple loops only. For ex- 
ample, a subroutine call inside a loop prevents its parallelization. To overcome 
those problems directives are introduced to increase degree of parallelization and 
to control manually many aspects of execution. 

In the implicit programming model we optimize the program using compiler 
directives whenever a loop over blocks, nodes or individual degrees of freedom 
are encountered. In particular the parallelization is applied for blocks construc- 
tion, computation of element stiffness matrices and assembly into block stiffness 
matrices, iterations of the block method. 

Shared-address space computers can also be programmed using the message- 
passing paradigm. The local memory of each processor becomes the logical local 
memory and a designated area of the global memory becomes the communication 
buffer for message passing. 



4 Results 

Our implementations have been tested with an example of flow problem known 
as a ramp benchmark of inviscid flow ca.A shock with Mach 10 traveling along 
a channel and perpendicular to its walls meets at time t = Os a ramp, having an 
angle of 60 degrees with the channel walls. A pattern with double Mach reflection 
develops and a jet of denser fluid along the ramp behind the shock is observed. 

Three parallel machines have been incorporated: HP SPP1600 (with 32 
PA7200/120MHz processors organized in 4 SMP hypernodes and software: 
SPP-UX 4.2, Convex C 6.5 and ConvexPVM 3.3.10), HP S2000 (with 16 
PA8000/180MHz processors in one hypernode using SPP-UX 5.2, HP C 2.1 and 
PVM 3.3.11) and SGI 0rigin2000 (with 32 R10000/250MHz processors, IRIX 
6.5, SGI C 7.3 and SGI PVM 3.1.1). During the experiments we run the code 
on a SPP1600 subomplex consisting of 16 processors from four SMP nodes. On 
S2000 and on 0rigin2000 (mentioned by S2K and 02K respectively) no other 
users were allowed to use those machines. 

The results refer to wall-clock execution time, T, for one time step (one 
finite element problem) chosen as a representation for the whole simulation, 
with different meshes (equal to 16858, 18241 and 73589 nodes) and with the 
same number of iterations (equal to 5) in the GMRES algorithm. In order to get 
statistically more reliable results the measurements have been collected three 
times from 5 subsequent time steps, since fluctuations in T of several percents 
were observed. 
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For a given case fixing the number of GMRES iterations we separate the 
problem of GMRES convergence from the problem of purely numerical efficiency 
of parallelization. 

In Fig. ma) the wall-clock execution time for different number of processors, 
K, and for different machines as well as for different programming models is 
presented. Number of nodes, N, has been rather moderate, equal to iV = 16858. 
The complicated architecture of SPP1600 has influenced the DSM model, while 
results from explicit model left undisturbed with no substantial influence of 
the lower bandwidth between the hypernodes. Better scalability is observed for 
the explicit model in comparison with DSM one, although the latter results 
are obtained with relative small programmer’s effort. In any case PVM model 
turned out to be more efficient with shorter execution time. For 30 0rigin2000 
processors and PVM model characteristics saturation becomes evident, while 
staying monotonous for DSM. 

Gomparing results for different models no clear explanation has been found 
for the performance difference between explicit and implicit models for one com- 
puting node (i.e. for K = 1). This is probably due to distinct nodes numbering 
in the meshes resulting in different cache performance. 

In Fig. nKb) timings for a greater mesh consisiting of N = 73589 nodes are 
shown. Again, the difference between PVM and DSM models are not high for 
rather small number of processors {K < 16). The significant diffrence is obtained 
for K > 16, i.e., for 02K hypercube dimensionality, d > 2. No characteristics 
saturation is observed for PVM model due to higher computation to commu- 
nication ratio. For DSM model unexpected response is found, with maximum 
for AT = 24 and monotonous execution time decrease in the next K range. This 
feature, which could result from a complicated node architecture, needs more 
considerations in future. 

Speedup (relative) values, S, are depicted in Fig. ^c). Good scalability is 
obtained for the message-passing model, however better for 0rigin2000 machine. 
Despite of the interval K > 16, DSM model demonstrates higher speedup on 
0rigin2000 than on S2000. In Fig. we present the wall-clock execution time, 
T, normalized to number of mesh nodes, N. Since the characteristics are very 
close each other, this confirms experimentally linear computational complexity 
o{N) of the algorithm. 

5 Conclusions 

The implicit programming model can bring useful and scalable parallelization 
of GFD applications. For cc-NUMA machines it is profitable to use it for rather 
small number of processors. Explicit programming gives better results in terms of 
scalability and execution time for the price of more complicated code structure. 

From GFD study it follows that the implicit program is more sensitive to 
communication speed than the explicit one. Ghanging from one SMP node exe- 
cution to multi- SMP node execution only slighty affects the performance of the 
explicit code, while influences significantly execution time of the implicit code. 
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So that we propose to use a heterogeneous model with implicit programming 
for a SMP node while staying with explicit one bewteen the SMP nodes. This 
model would be profitable also for clusters of SMP workstations. 
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Abstract. We use Neural Networks (NN) in order to design control ar- 
chitectures for autonomous mobile robots. With PVM, it is possible to 
spawn different parts of a NN on different workstations. Specific message 
passing functions using PVM are included into the NN architecture. A 
graphical interface helps the user spawning the NN architecture and mon- 
itors the messages exchanged between the different subparts of the NN. 
The message passing mechanism is efficient for real time applications. 
We show an example of image processing used for robot control. 



1 Introduction 

Our research group develops architectures for the control of autonomous mobile 
robots. We take mainly inspiration from neurobiology for designing NN archi- 
tectures. This work is based on a constructivist approach. We first build the 
architecture parts dealing with the inputs processing, and managing some low 
level behaviors (obstacle avoidance for instance). Like in Brooks subsumption 
the system consists in a hierarchy of sensory-motor loops: from very fast low 
level sensory-motor loops (obstacle avoidance, sensor reading ...) to very time- 
consuming image analysis and planning procedures. In our system, data issued 
by these mechanisms are processed by higher level modules which may in return 
influence them. This architecture is integrated in the perception-action frame- 
work (Per- Ac Pj). It is composed of a reflex pathway and a learned pathway. 
Once learning has taken place, it may override the reflex mechanism. The inputs 
to the system are the image taken from a CCD camera, infra-red sensors and 
the direction of the north given by a compass. Each of these inputs may be pro- 
cessed separately. Moreover, some time consuming image processing algorithms 
may also be performed in parallel. Using a set of workstations, it is clearly not 
interesting to parallelize the network at the neuron level (one workstation per 
neuron !), but rather in term of computational pathways (CP) being processed in 
parallel. Each CP corresponds to a functional part of the global NN architecture 
and may run on a different workstation. The exchange of information between 
CPs is most of the time asynchronous. We also want to minimize the number of 
messages exchanged. We use PVM Q for spawning the CPs on the workstations 
and for the message passing libraries 0. 

^ Current version used is PVM 3.4.3 for Solaris and Linux 



J. Dongarra et al. (Eds.): EuroPVM/MPI2000, LNCS 1908, pp. 289-^23 2000. 
(c) Springer-Verlag Berlin Heidelberg 2000 



290 Mathias Quoy et al. 



Using PVM has two interests. First, the architectures we have developed become 
bigger and bigger as the task being performed is more complex. So even if the 
computational power is increasing, we do not match the real-time needed for 
our robotic experiments. Second, brain computation follows different pathways 
in parallel and is also performed in different cortical areas. Thus, it is also inter- 
esting to preserve this parallelism in our simulations. 

PVM is also used at two different levels. First, we have developed specific com- 
munication functions integrated in the global architecture indicating the kind of 
information sent from one NN architecture to the other (Section 2). The specific 
message passing algorithms described may be used in any real time application, 
in particular when there is a huge amount of data to deal with. Second, a process 
manager has been developed. This manager helps the user choosing on which 
workstations to run the NN architectures, and monitors all message exchanges 
(Section 3). In section 4, we study the performances of a message passing mech- 
anism used in a NN architecture controlling a mobile robot. This architecture 
enables a robot to go where it has detected movement in its image. 

2 Designing NN Architectures 

We do not develop here how we construct the NN architecture for performing a 
particular task (indoor navigation for instance). This is the focus of numerous 
other papers We will rather stress how PVM fits in our NN architecture. 

The design of a NN architecture is performed using a specific program called 
Leto. It has a visual interface where the user may choose between several differ- 
ent kinds of neurons and connect them together. Not all groups represent neu- 
rons. Some may perform specific algorithmical functions (displaying an image on 
screen, giving orders to the robot ...). Once designed, a Leto architecture may 
be saved and is run using another program called Promethe (see next section). 
So, it is possible to design different architectures dedicated to specific tasks. The 
problem is to exchange data between these different tasks. This is where PVM 
is used at this level. So we deal here with the design of modules that may be 
integrated in the NN architecture. These modules tell whom to send a message 
to, or who to receive a message from. 

We have coded specific message passing functions using the already existing 
PVM routines. Our functions are implemented as algorithmical groups in the 
architecture. There is basically one group sending data in one NN architecture 
and another receiving data in another NN architecture. The data sent is the 
neuron values. There may be several different NN architectures running in par- 
allel. And there may be several different sending and receiving groups in each 
NN architecture. So we need to define where to send and from whom to receive. 
This is implemented through the definition of symbolic links which must be the 
same for the sender and the receiver. This symbolic link is coded on the name of 
the link arriving to a sender or receiving group. This name is composed of two 
parts: the symbolic name and a number corresponding to the message passing 
number used by PVM. After having launched all tasks, a symbolic link table 
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is built. This table makes the matching between a symbolic name and a PVM 
task identifier (tid) number. Then the relevant parts of this table are sent to the 
different tasks. In order for two tasks to be able to communicate, the symbolic 
link name and the message number must be the same 0. Upon reception of a 
message, the activities of the neurons of the receiving group are the same as the 
ones of the sending group. Thus, from the receiving group point of view, it is as 
if it were directly connected to the sending group through a “real” neuronal link. 
The message passing functions are functionnaly invisible. The different sending 
and receiving functions are the following: 

— functiori-send_PVM: sends the value of the neurons where it is connected 
to. The receiver is retrieved using the symbolic link name and the message 
number. 

— functionjreceive-PVMJilock: waits for a message (neuron values) from a 
sender. The sender is identified by its symbolic link and message number. 

— function-receive-P VM_nonJ)lock: checks if a message (neuron values) has ar- 
rived from the sender. If not, execution continues on the next neuron group. 

We are running our NN architectures in real time for controlling mobile 
robots. Depending on the task to perform, it may be very important to have 
access to up to date information. For instance, computing the optical flow must 
be performed on the most recent image data. This is not mandatory for object 
recognition, since it stays where it is. Nevertheless, it turns out that most com- 
putations are asynchronous. Because we have to run in real time, we do not want 
to wait for a task to be finished before continuing the whole computation, and we 
do not always know which task will run faster or slower. So in the asynchronous 
processing, there may be two problems: information may be missing in some 
parts of the architecture, because some computation is not finished yet. Con- 
versely, some parts of the system may run much faster than others and deliver a 
continuous flow of redundant information. Indeed, if the sending task is running 
much faster than the receiving one, the receive queue will be overwhelmed with 
messages. Moreover, the only important message to process is the last sent. 

It is easy to solve the first problem. If information has not arrived yet, the pre- 
vious neuron values are used. We suppose there is a temporal coherence in the 
neural information (internal dynamics of the neuron jS| and use of a neural field 
0). For dealing with the second problem, we have to introduce new message 
passing functions: 

— function_send_PVM_ack: sends a message only if the receiver has allowed it 
by sending an ack message. If the message has not been received, it does not 
send anything and execution continues on the next group of neurons. 

— function-receive-P VM_block_ack: waits for a message (neuron values) from a 
sender. Once the message received, this function sends back an ack message 
to the sender. 

^ The message number is not mandatory. It may be chosen randomly at the symbolic 
link table creation. But it is easier to know it beforehand for debngging pnrposes. 
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— function_receive-PVM_nonJ)lock_ack: same as the previous one, but waiting 
is non blocking. 

Thus now, a sender may only send information if the receiver is able to 
process it. The way we have implemented the function, the receiver sends its 
ack message just after having received a message from the sender. Thus, the 
sender is now allowed to send a message again. If the sender runs faster than the 
receiver, it sends its new message before the receiver has been able to finish its 
computation. So, when it comes back to the receive function, a message is present 
in its queue, but this message is not reflecting the current state of the sender 
anymore. An alternative version of the receiving function could be to send the ack 
before a blocking receive. It is now the receiver which will wait until the sender 
catches the ack, and sends its message. Thus the message received by the receiver 
matches now the latest state of the sender, but the sender has to wait for this 
message. This function is function_receivc-PVM_ack_block. It avoids saturating 
the Ethernet link and loosing too much computation time in communication 
procedures. 

3 Running NN Architectures 

A NN architecture is run using a specific program called Promethe. In each 
NN architecture, a token is moving from neuron group to neuron group acti- 
vating them sequentially. When a PVM message passing group is activated, the 
corresponding function described above is executed. Once all groups have been 
activated, running resumes on the first group (fig. Pi. 

We have seen that PVM is used as message passing library between tasks 
running NN (Promethe processes). It is now necessary to have a task manager 
(Promethe_PVMJDaemon) spawning all these tasks. In particular, this process 
has to build the symbolic link table and send it to the NN tasks (fig. P). In 
order to achieve this, the user has to define a file {name.pvm) indicating: the 
task name, the name of the workstation where to run the NN , the name of the 
NN architecture. Then, each symbolic link is given with its sending and receiving 
task. After having built the symbolic link table, the program displays a graphical 
interface using Xview. We have included in this interface some helpful features 
such as: testing the network for the various workstations available, testing the 
speed of these workstations and ranking them, assigning a task to a workstation, 
displaying the current state of the tasks (running or waiting), and displaying in 
real time the messages exchanged between the various tasks. Assigning a task 
to a workstation may be done either as specified in the name.pvm file, or on the 
fastest workstations only, or randomly. All three options are available on line in 
the graphical interface. A task may also be assigned to a specific workstation by 
selecting the workstation’s name and then clicking on the task (shown as a square 
in the display window) . The state of the tasks and the messages exchanged are 
monitored by the task manager. The purpose here is not to have a complete 
debugging tool. It is rather thought as an help for the user. This help gives two 
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Fig. 1. Sketch of the message passing between the task manager process 
Promethe_PVM_Daemon and the launched Promethe tasks. Each Promethe task is 
a NN architecture composed of groups of neurons. A token is moving from one group 
to the other activating them sequentially. 



informations: whether a task is running or waiting for a message, and which 
kind of message has been exchanged between tasks. The first information gives 
hints about the working load of each task, and the efficiency of the message 
exchanges. The second information allows to follow in real time the messages 
exchanged between the tasks. Each time data is sent or received, a message is 
issued to the manager, giving the sending and receiving tids and the message 
number (fig. ^1. So, once all tasks are launched, the manager waits for these 
messages, and displays the information on screen. The manager also catches all 
PvmTaskExit signals so that it may resume its activity once all Promethe tasks 
are over. Monitoring the message exchanges could also have been implemented 
another way, the manager only waiting for any message exchanged between any 
tasks, and then sorting these messages depending on the message number. This 
supposes that all message numbers are different, which is not required if the 
symbolic links are different. 



4 Results 

We report here the performance results on a particular robotic experiment. Note 
that the robot 0 is also a multi-processor system (a micro controller for the speed 

® Koala robot built by KTeam SA, Switzerland 
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control and a 68340 microprocessor for the communications with the worksta- 
tions and the control of the different sensors). This system is not under the 
control of PVM but it also works in an asynchronous manner. 

The task of the robot is to learn to imitate a given teacher [ 3 . At the begin- 
ning, the robot can only use motion perception for going in the direction of the 
moving object (supposed to be the teacher). If this behavior is not associated 
with a negative reward, the robot also learns the static shape of the moving 
object, so as to be able to move in its direction even if no motion is perceived. 
This mechanism is also used to choose between different moving objects 0. In 
order to detect movements, the optical flow is computed. In parallel, an object 
recognition is performed from the matching of local subimages centered around 
the local maximal curvature points. These feature points correspond to the local 
maximum of the convolution of the gradient image with a Difference Of Gaus- 
sian (DOG) function. 

A sketch of the parallel architecture is given figure El Task2 performs the fea- 
ture points extraction and shape recognition. Taskl performs data acquisition, 
movement detection and robot movement control. These two tasks are not fully 
synchronized in the sense that taskl is not waiting for the result of task2 for 
continuing its computation (non blocking receive). However, task2 needs the in- 
formation from taskl for its computation, because it looks for the feature points 
only where movement is detected (in the case there is a moving object. In the 
other case, the whole image must be processed). The data given by taskl to 
task2 is a 196x144 array of bytes (image) and a 35x25x3 array of floats corre- 
sponding to where movement has been detected. Task2 is sending to taskl a 
35x25x3 array of floats corresponding to the position of a recognized teacher. 




Fig. 2. Sketch of the computation done by the two tasks. Note that because the receive 
in taskl is non blocking, the computation effectively executed in parallel is not always 
what is marked as Processing! and Processing2. Processing2 may be executed in task2 
whereas taskl executes data acquisition and movement detection. 



We have monitored the execution time in three different cases: the sequential 
computation (without PVM) of the optical flow alone (will become taskl), the 
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sequential computation of the optical flow and the teacher recognition, and the 
independent computation of the flow (taskl) and the teacher recognition (task2). 
These results are given in the following table (mean-time in seconds is an average 
on 20 runs on SUN Sparc Ultra 10 workstations (part of a pool of workstations 
on a 100Mb Ethernet link). Each run corresponds to a new experiment (in par- 
ticular new input data). Each task is running on a separate workstation): 



time in seconds 


seq. flow 


seq. flow -1- 


recog. 


PVM flow -T recog. 


seq. mean time 


3.06 


9.38 


- 


taskl mean time (0) 


- 


- 




3.97 


task2 mean time (1) 


- 


- 




2.81 


par. mean time (max) 


- 


- 


3.97 


taskl mean time 
acquiring data (2) 


- 


- 


1.47 


taskl mean time in PVM 
(sending and receiving) (3) 


- 


- 


0.03 


processing 1 (0) - (2) - (3) 






2.47 


task2 mean time in PVM (4) 


- 


- 


0.32 


processing 2 (1) - (4) 






2.49 



As expected the parallel computation runs faster than the sequential one. In 
average task2 runs faster than taskl, mainly because of the time spent in the 
communications between the robot and the workstation. Thus as task2 has to 
wait for data from taskl in a blocking receive, the time spent in PVM functions 
(in particular waiting for data) is longer. Communications between the robot 
and the workstation is slow, so it would be particularly interesting to have a 
process dedicated to this task. These informations may then be dispatched to 
other processes. 



5 Discussion 

The message passing mechanism we have developed enables to use PVM for 
real-time applications. A message is sent only when the receiving task is ready 
to process it, thus reducing the network load level. Some tasks may need specific 
data in order to perform their computation, others may continue working even 
if the newest information is not available yet. 

We have used PVM for parallelizing independent sub-networks composed of 
several groups of neurons. Some groups contain thousands of neurons. A further 
speed-up of our system will be the parallelization of the computation inside 
a group of neurons. We haven’t tested yet whether PVM or threads (shared- 
memory) based algorithms should be used. In the later case, we would use multi- 
processor architectures (bi or quadra Pentiums for instance) . On a bi-processor 
architecture, set of threads will update different subgroups of neurons and will 
almost divide by two the computation time devoted to a map of neurons. 
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This work is part of a project dedicated to the development of a neuronal 
language. Our Leto interface already allows to quickly design NN globally, ie. 
without specifying by hand each connection for instance. This construction is 
therefore made on a graphical interface: the user does not to write down his spec- 
ifications in a file. The difficulty is to provide the user with enough (but not too 
much) different group of neurons and links between them to choose from when 
designing his NN architecture. By the time, we begin to have standard neuron 
groups and links. Moreover, some parts of the architecture are now stable and 
may be used without any changes for any other robotic experiments. This is 
the case for the image processing part for instance. A next step will be then to 
provide “meta-groups” of neurons (the equivalent of cortical areas) regrouping 
stable functional groups. 
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Abstract. In this paper, we present another study of an improved par- 
allel DSIR text retrieval system that can perform fast indexing of sev- 
eral gigabytes of text collection using Pentium-class PC-cluster. We use 
multiple-master/slave principle to implement this parallel indexing al- 
gorithm. In a computing node, a master process has been designed to 
work in conjunction with each slave process in order to utilize as much 
as possible the computing power during one or another process is wait- 
ing for I/O. We also present special buffering and caching techniques 
for boosting the computing performance. We have tested this algorithm 
and presented the experimental results using a large-scale TREC8 col- 
lection and investigated both computing performance and problem size 
scalability issue. 



1 Introduction 

The number of increasing web pages on the Internet, as well as all electronic tex- 
tual documents in academic, business, jurisdiction, etc. continue to make us more 
difficult to easily access to relevant information in a reasonable time. A powerful 
searching or retrieval artifact like an efficient information retrieval system is thus 
required. In this paper we present another study of an improved parallel DSIR 
retrieval system that can help the end-user to reach his relevant information 
by performing fast indexing of several gigabytes of text collection. DSIR is a 
vector space based retrieval model in which distributional semantic information 
inherently stored in the document collection is utilized to represent document 
content in order to achieve better retrieval effectiveness |31 • DSIR model consists 
of two parts, indexing and retrieval. Indexing part manipulates the text collec- 
tion to be ready for searching, and retrieval part retrieves documents that match 
user needs. 

Since, indexing method in DSIR is quite compute-intensive, a powerful single- 
CPU computer is still not enough to manage several gigabyte of textual database 
0, we thus mainly focus here on a new DSIR indexing technique using multiple- 
master/slave principle. In this multiple-master/slave model, the master process 
has been designed to work in conjunction with each slave process to utilize as 
much as possible the computing power on each machine during one or another 
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process is waiting for I/O. This principle can be called “co-operative master- 
slave concept” since each master and slave can co-operate to solve the dependent 
problems. We implemented this parallel indexing algorithm using MPI on 
Beowulf PC Pentium class machine running Linux operating system Q. Dur- 
ing our experiments, we chose TREC8 documents as our large test collection 
(3. TREC8 collection consists of more than half-million and varying length of 
free-text documents. We indexed FBIS (one set of documents in TREC8) col- 
lection, implemented additional caching and buffering techniques, and studied 
the processes’ communication effectiveness of our proposed algorithm by varying 
both cache and buffer size. We also indexed the full version of TREC8 to study 
the scalability issue of our algorithm, and found that it still scale quite well as 
the problem size is increased. 

We organize this paper in the following way. Section 2 briefly presents the 
DSIR indexing model. Section 3 discusses the proposed parallel DSIR indexing 
algorithm. Section 4 gives more detail about our experimental setup, results, 
and discussion. Finally, section 5 concludes this paper. 

2 DSIR Indexing Model 

DSIR is a vector space base retrieval model which adapts distributional seman- 
tics to alleviate the effects of polysemy and synonym found in documents. In this 
model, the context of words are used to characterize the meaning of documents 
PJ. In general, every word which is an elementary entity that holds the mean- 
ing contributes its own semantic, according to its occurrence and co-occurrence, 
to the whole content of the document in which it appears. The co-occurrence 
statistic of a word in DSIR is defined as the number of times that word co-occurs 
with one of its neighbors within a pre-defined boundary, called “distributional en- 
vironment” . Possible distributional environments can be sentences, paragraphs, 
sections, whole documents, or windows of k words. In DSIR computational model 
(See Figure 0, distributional information extracted from a collection of docu- 
ments can be mapped as a matrix, called “co-occurrence matrix” . Each row of 
the matrix represents co-occurrence statistic between an index term and its con- 
text which are assigned in the column of the matrix. Using this co-occurrence 
matrix, the document vector can be derived by applying Equation where 
rriij records the co-occurrence frequency between Xi and yt extracted from the 
document collection, and W{fni) records the weighting functionjS| addressing 
the importance of the word i in document n. 

I I 

^{dn) = C^W{fni)mn, ( 1 ) 

3 Parallel DSIR Indexing Algorithm 

DSIR indexing phase takes a large amount of computing time while the memory 
requirement of the algorithm is too high to be provided by a single machine. To 
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Fig. 1. Co-occurrence matrix and document representation in DSIR model. 



overcome this limitation, we propose a new approach employing parallel and dis- 
tributed computing using multiple master/slave principle. In this principle, each 
computing node consists of two processes; master process, called ’’Producer”, 
and slave process, called ” Consumer” . The producer is responsible for comput- 
ing any stuff of algorithm, and the consumer provides shared data structure, i.e. 
the co-occurrence matrix. This design alleviates the I/O blocking coming from 
the traditional single master process and several slaves’ model. 

During the co-occurrence matrix computation, the large co-occurrence ma- 
trix is generated. To reduce time to compute the co-occurrence statistic, the 
whole textual document collection is separated equally and distributed to each 
computing node. The consumer on each machine is assigned to host a parti- 
tion of the co-occurrence matrix. Refer to Figure E above, the computing node i 
stores the co-occurrence vectors from rows {i — 1)1 /NP to il/NP, where / is a 
number of index terms and NP is the number of computing machines. The co- 
occurrence data in each machine is computed by the producer, while the update 
of co-occurrence matrix partition is done by the consumer. 

To explain more clearly, the producer first extracts co-occurrence statistic 
from a local sub-collection, then identifies and sends this data to the consumer 
at its corresponding destination machine. The destination machines can be de- 
termined by consulting co-occurrence routing table. This table provides mapping 
between co-occurrence vectors and its host nodes. As for the consumer, it col- 
lects the incoming co-occurrence data and updates the co-occurrence vectors in 
its partition. To avoid the problem of I/O blocking, this algorithm employs MPI 
asynchronous communication. 

After the co-occurrence matrix is constructed, the document vectors can be 
derived. During this phase, the producer reads a document from its local sub- 
collection, then converts it to a set of terms. Refer to the equation m, as the 
co-occurrence matrix is located in each computing node, the producer send those 
terms to the corresponding consumers so that the portions of the final vector are 
computed. Finally, the producer gathers the portions of the final vectors from 
consumers to produce the required document vector. 
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3.1 System Tuning 

In co-occurrence matrix computational process, there are a large amount of 
global messages transmitted to update remote co-occurrence partition. To re- 
duce the number of communications, all of the co-occurrence statistics of doc- 
uments are buffered until some thresholds are reached. This threshold can be 
the boundary of a document or multiple documents, then distributed to each 
consumer instead of element-by-element send. 

To avoid a large number of remote co-occurrence matrix access, the pro- 
ducer caches some parts of each co-occurrence matrix in main memory. For the 
effectiveness of cache performance, only high document frequency parts of the 
co-occurrence matrix are cached. This can be done in the following way. First, in- 
dex and context terms in the co-occurrence matrix are sorted by their document 
frequency in decreasing order. Second, co-occurrence vectors of index terms are 
distributed to each computing node using simple round-robin method. Then the 
co-occurrence vectors of high document frequency index terms which locate in 
the top rows of a partition are cached. 

4 Experimental Results 

In this section, we present the results concluding from two sets of experiments. 
For the first experimental setup, we study the computing performance between 
before and after using the buffering technique on the co-occurrence matrix com- 
putation process, and the caching technique on the indexing process. For the 
second experimental setup, we examine the scalability issue of our implementa- 
tion by using a full set of half million-page TREC8 documents. All of experi- 
ments are performed on a cluster of 16 Intel Celeron 466MHz processors, each 
is equipped with 128MB of RAM and simple 10GB IDE drive. All computing 
nodes are connected via 100Mbps Ethernet switch. 



4.1 Effect of Using Buffering and Caching Techniques 

To study the effect of the buffering and caching technique in our multiple pro- 
ducer/consumer model, we use FBIS 140317 documents to be our test collection. 
We choose 20000 index terms and 2500 context terms, yielding the co-occurrence 
matrix of 20000 by 2500 elements. Several curves from Figure O and Figure El 
conclude the results from these experiments. 

We can discuss these results into two parts. For the co-occurrence matrix 
computational process, a speedup factor is decreased when we test on two com- 
puting node. Since the co-occurrence matrix is divided and scattered to several 
consumers, there are many update data traveling between producer and host con- 
sumers via the network. Updating the co-occurrence matrix via network takes a 
large amount of time more than single machine takes to access the whole matrix 
in main memory. Thus the most of computational time is wasted for data trav- 
eling back and forth in the network. The speedup factor starts to increase while 
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Fig. 2. Co-occurrence matrix computational process with caching and buffering 
technique. 




Fig. 3. Document vectors derivation process with caching technique. 



the efficiency curve slightly decreases when we add more computing nodes into 
the system. Moreover, when larger buffer and cache size have been employed, we 
can achieve a higher speedup factor. This result shows that both buffering and 
caching technique help very much to reduce the network traffic between several 
consumers. 

For the document vector derivation process, the caching technique has only 
a slightly effect on the speedup factor, even though larger size of cache has been 
added to the system. After examination closely what the problem occurs with 
the computing performance when increasing more cache size, we discover the 
fact that the producer must wait for a long time to gather large size of messages 
of document vector portions computed by the corresponding host consumers in 
order to derive the final document vector. Sending and receiving a large portion 
of messages via network is the main defect (of the message-passing library itself, 
perhaps) that causes the inefficiency of our caching technique proposed here. 
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4.2 Scalability 

To study the scalability issue of the proposed algorithm, we set up the exper- 
iments in the following way. We took the whole TREC8 as our test collection, 
and reran experiments with 20MB cache and 20 buffer size. Results from these 
experiments are depicted in Figure 0 and Figure 0 




Number ot computing node(s) 




Number of computing node(s) 



Fig. 4. Problem size scalability when using TREC8 as the test collection in 
co-occurrence matrix computation process. 




Number of computing node(s) 




Number of computing node(s) 



Fig. 5. Problem size scalability when using TREC8 as the test collection in 
document vectors derivation process. 



The results show that our proposed algorithm still scales quite well when 
the problem size has been increased. This result show that multiple master-slave 
principle can assist to solve a large compute-intensive problem like large-scale 
text retrieval. 
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5 Conclusion 

This paper proposes another parallel DSIR text indexing algorithm using multi- 
ple master/slave model, and presents experimental results using a large TREC8 
collection. We illustrate that parallel and distributed computing technique like 
PC-cluster can be used to build an efficient tool for indexing a large text collec- 
tion. In particularly, we have found that designing a fine grain parallel algorithm, 
i.e. co-occurrence matrix computation and document vector derivation, in DSIR 
model is not easy to achieve for a perfect speed-up due to the problem of com- 
puting too much inherently global data in the DSIR model itself. However, this 
problem can be alleviated by using caching and buffering technique. We also 
found that co-operative master/slave computing technique can be used to in- 
crease computing performance, and we believe that it is suitable for solving 
a large compute-intensive problem like information retrieval. We anticipate to 
ameliorating our parallel DSIR indexing algorithm and testing it with several 
million web-page documents in the next future. 
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Abstract. The implementation issues and results of a technique named Switch 
Time Warp (STW) for improving Optimistic Parallel Discrete Event Simulation 
(PDES) in PVM environments. The STW mechanism is used for limiting the 
optimism of the Time Warp method. The idea of the STW consists of attempt- 
ing to adapt the execution speeds of the different LP dynamically in order to 
minimise the number of rollbacks, and as a consequence, reducing the simula- 
tion time. The proposed method achieves significant time/performance im- 
provements for rollback reduction in PDES. 



1 Introduction 

The main objective of this paper is to describe the development and the evaluation of 
a tool designed to build distributed discrete event simulators based on Switch Time 
Warp algorithm [9][11]. This tool will have to demonstrate the validity of the algo- 
rithm proposed in [9] and at the same time is intended to be a step previous to building 
a flexible simulation environment that allows concrete problems within the field of the 
high performance simulation to be solved. 

The simulation is one of the fields that requires most processing time and therefore 
parallel simulation is outlined as a useful tool to give response to given problems in an 
acceptable time. 

Parallel simulation is also economically justified by the possibility of distributing 
the task between machines connected in a network (cluster). It is possible to develop 
high performance simulators taking advantage of normal (modest) machines incorpo- 
rating some tools like PVM [8] to create a cluster-based parallel machine, i.e. PC- 
Linux cluster. 

The problem of simulating the behaviour achievement of complex systems prom- 
ises to be one of the problems that present us with the greatest range of possibilities in 
the forthcoming years. Problems such as air traffic control, the behaviour of telephony 
interconnection nets, meteorological predictions, behaviour of populations, distribu- 
tion of information, etc. [5] can all require the obtention of information from parallel 
discrete event simulation. 



' This work has been funded by the CICYT under contract TIC-98-0433 
J. Dongan-a et al. (Eds.): EuroPVM/MPI 2000, LNCS 1908, pp. 304-312, 2000. 

© Springer- Verlag Berlin Heidelberg 2000 
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2 Parallel Discrete Event Simulation (PDES) 

The concept of a discrete event simulator can be defined as a model of a physical 
system that changes state only in discrete time instants and that it is controlled by 
events, i.e, the system observation is only accomplished during the occumence instant 
of an event. Discrete event simulation is interesting since it does not imply greater 
complexity in obtaining an events model from a physical system. It is only necessary 
to extract the actions and the state changes that occur in the system. 

2.1 Describing PDES Using an Example 

To clarify the above observation we will consider an example of a physical system in 
predicting its behaviour. In this particular case we are interested in knowing the de- 
velopment of a population of wolves and goats obliged to live together. This system is 
within the prey-predator category. The wolf is a natural predator of the goats and 
therefore the goats are condemned to extinction. But the number of goats is much 
higher that the number of wolves and therefore they may not necessarily be wiped out. 
Once the physical system is identified, it is necessary (analysis phase) to obtain and 
define the events. In this case the conclusion is that there are basically 6 events that 
can be associated to an occurrence probability: 



Event 


Description 


Probability 


El 


A goat is bom 


0.1 


E2 


A wolf is bom 


0.08 


E3 


A goat dies (natural) 


0.096 


E4 


A wolf dies (natural) 


0.054 


E5 


A goat is eaten by a wolf 


0.2 


E6 


Wolves die through insufficiency of food (goats). 


0.00001 



With this discrete event model it is necessary to build a simulator that will repro- 
duce the behaviour of the physical system through these events. Moreover the follow- 
ing equations will be necessary to simulate the evolution of the populations: 

1 . AC = Goats that are bom - Goats that die 

2. ALL = Wolves that are bom - Wolves that die 

3. C(tn) = C (tn_i) -t AC (t) (C(tn) ^ Number of goats at t„) 

4. LL(t„) = LL (tn_i) -t ALL (t) (LL(L) = Number of wolves at t„) 

5. Goats that die = goats that die of natural death -I- goats that are eaten 

6. Wolves that die = wolves that die of natural death -I- wolves that die through 
lack of food 

The variable t defines the current simulation instant (t„ is the virtual time at the in- 
stant n). The previous model proposed is very simple since it studies the interaction of 
two species. But if the goal is to study the evolution of a set of species, for example a 
model that includes 100 species, considering that in some degree all of them are 
predators, the problem becomes very complex. This situation would imply 10200 
possible events to try for each simulation instant (100 for birth, 100 for natural deaths 
and 10000 for depredation). This, obviously, implies a temporary complexity of the 
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problem and is intractable within a reasonable time. From here emerges the need for 
accelerating the simulation process. The Parallel Discrete Event Simulation (PDES) is 
the solution to this problem because it allows us to increase the computation power by 
using parallePdistributed systems. The idea is to divide the problem (events) into a set 
of partitions and to assign each one of these to a processor of the parallel/distributed 
system. These partitions will be linked by event exchange (timestamped messages in 
the distributed system). From the discrete event model described, a Parallel Discrete 
Event Simulator can be built. For this it is necessary to divide the discrete event 
simulator into a set of tasks (logical processes: LP's) so that each LP simulates a part 
of the model. 

In our example, we can built a logical process "goats" that process the events gen- 
erated by the goats (birth, death or depredation) and a "wolf process" that accom- 
plishes the same task with the events generated by the wolves. It will also be necessary 
to have an interface of communication between the two processes to process those 
events generated by a process and consumed by the other (depredation). To profit 
from the availability of distributed systems, in simulation speed, it would be necessary 
to distribute the processes to a different processors. 

Each LP processes external events (generated by others LPs) as well as internal 
events (generated by the intemal/local processing of the events). Each event can 
change the local state of the system and / or to generate one or more new events. The 
global state of the system is defined in terms of the different local states of the LP's 
that form the simulation. The PDES can be seen as a set of interrelated processes 
where each one simulates a subspace of the space-time of the problem. 

The main problem of the PDES is that depending on the distribution of LP's it 
should be able to assure that the simulated system is causal. The Local Causality Con- 
straint (LCC) says: A discrete event simulation that consists of a set of LP's interre- 
lated through timestamped messages fulfds the causality principle if (only if each LP 
processes the events in a non-decreasing order of time, i.e., the future cannot affect 
the past [2] . The causality problem does not occur in the sequential simulator. In a 
PDES simulator it is necessary to assure that the processes act in synchronised form. 
We can consider for example that the wolves’ process (of the previous example) will 
be implicitly slower than the goats’ process. The wolves’ process (slower) will have 
advanced to ti in their local time (local virtual time LVT) and the goats’ process 
(faster) will have evolved to t 2 (t 2 >ti). If at this time the wolf process generates an 
external event (E5 -A goat is eaten by a wolf-), this event will have to be processed by 
the goats’ process in instant ti and not in t 2 . The goats' process must comeback up to 
time ti, and to cancel all the actions produced between ti and t 2 . In this situation the 
principle of Local Causality Constraint (LCC) is unfulfilled. 

As it is impossible to solve beforehand what the relative execution speed between 
the logical processes will be, a synchronisation mechanism is necessary. This syn- 
chronisation gives two different algorithm families in PDES: conservative and opti- 
mistic algorithms. 

The conservative algorithms are those that strictly observe the causality principle. 
The quantity of parallelism that can be extracted from the conservative algorithms is 
very limited and it is generally because other types of algorithms (optimistic), which 
present a greater profit, are preferred. 
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Optimistic algorithms do not adhere strictly to the "LCC law", and for this reason, 
they can produce causality errors, but the algorithms have mechanisms to detect these 
errors and cancel their effects. In an optimistic algorithm, a LPi will remain blocked 
only if it does not have either input message or internal events to process. In other 
cases it will choose the event with the smallest timestamp. If at some subsequent time 
an event with a smaller time arrives, it will produce a causality error. At such a mo- 
ment the algorithm will have to return to a sure state and to annihilate the effects pro- 
duced from this state to the last event processed. This situation of returning to a sure 
state forces the algorithm to keep the old states in memory. This is one of the weak 
points of the algorithm. The Time Warp (TW) is the most referenced implementation 
of optimistic algorithms [5][2]. 

The TW algorithm is an optimistic mechanism based on the virtual time of each 
process. Each process LPi has its own local virtual clock (LVT). A causality error is 
detected when a process LP receives a message that contains a smaller timestamp than 
the LVT. In this situation the rollback mechanism is activated. This mechanism con- 
sists of returning to a sure state and annihilating the effects of the events processed 
from this state until the arrival of the event that produced the error. The principal 
problems of the TW that affect the speed of the simulation are: a) Lost of simulation 
work (rollbacks), b) Annihilation messages (anti-messages) sent to remove the previ- 
ous incorrect messages, c) Spread of the annihilation messages if the destination proc- 
ess has already processed the wrong message (rollback chains). 



3 Switch Time Warp Algorithm 

In a TW mechanism the relative execution speeds of the different LP's cannot be con- 
trolled and the problem that some LP's have greatly progressed with respect to another 
can emerge. In this situation, the interrelation between low-high speed processes will 
generate rollback. An incorrect LP’s speed balancing will generate a higher number of 
rollbacks and the processes will lose a lot of time in processing them; the speed profit 
introduced by the algorithm can therefore be very poor. 

Several approaches have been developed to solve this problem: the Moving Time 
Window [10], the Breathing Time Window [13], the Adaptive Time Wrap 
Concurrency Control Algorithm [1], the Probabilistic Cost Expectation Function Pro- 
tocol [3], or the PADOC Simulation Engine [4]. 

The idea of the proposed algorithm "Switch Time Warp (STW)" [9][11] consists of 
attempting to adapt the execution speeds of the different LP dynamically in order to 
minimise the number of rollbacks, and as a consequence, reducing the simulation time. 
Our model is based on the general case, where the number of LPs required for a 
simulation («) do not match the number of processors available to run the simulator 
(p) with n»p. In this case, the maximum simulation speed can be achieved by a full 
occupation of the available processors time, performing correct simulation work. 

Our proposal includes a process manager (local scheduler) in each processor to lo- 
cally monitor the dynamic behaviour of the LPs allocated in it. The main task of this 
manager is to optimise the CPU-time occupation by balancing the relative execution 
time of those LPs involved in rollbacks. 




308 



Remo Suppi, Fernando Cores, and Emilio Luque 



Once a certain threshold value for the number of rollbacks and anti-messages (anti- 
messages are sent to undo the incorrect work carried out by a process that a rollback 
executes) is detected in an LP, the process manager will try to adequate the relative 
speed of the implied processes. The manager will slow down the fastest LP involved 
in this high generation of rollbacks (assigning it less CPU time so as not to further 
advance its LVT), and spend that processing capability to accelerate slower LPs. The 
threshold value is the point at which we have to decide when it is necessary to limit 
the optimism of the TW. A detailed description of the STW algorithm can be found in 
[9]. 

4 Implementation Issues 

The algorithm for STW is similar to the TW mechanism. The difference is that a new 
set of function calls have been added in order to compute the necessary values for 
evaluating whether the process runs in an optimistic/over-optimistic state. These func- 
tions are locally executed on the node, which add a fixed computing overhead to all 
the processes The overhead added by this process is very low for the effective utilisa- 
tion-rate improvement obtained (4.5% (max) compared to the execution time of the 
TW) . 

The STW has been implemented for Unix^“ systems (SunOS 5.x) using PVM li- 
braries [8] and STLv3 (Standard Template Library). PVM (Parallel Virtual Machine) 
facilitates work with an heterogeneous distributed machine. In our design of the STW 
mechanism, using PVM library, we associate a Unix process for each LP in our PDFS 
scheme. This LP is a node in the distributed simulation application and communicates 
with other LP’s using the PVM library functions. Figure 1 shows the STW processes 
hierarchy where the arrows indicate the precedence of the processes’ creation. All the 
processes, except the "father" process, are created by the PVM spawn primitive and 
are controlled as a PVM group in each processor. The "father" process creates all the 
STW schedulers from the user configuration. In each processor, the STW scheduler 
creates the LPs that will execute the simulated application. 

An important part of the implementation of the SWT mechanism has been the inte- 
gration of the STW-scheduler with the CPU process-scheduler provided by the oper- 
ating system. In our case (SunOS 5.x) the OS supports three types of scheduling 
classes for processes: Real-Time (RT), System (SYS) and Time-Sharing (TS). The 
LPs processes are executed in the TS class (and we assume that there are no processes 
in the RT class). The STW scheduler (that operates in highest priority TS class) over- 
sees the behaviour of all LP processes and changes the priority of the process that run 
in an over-optimistic state (having an optimism greater that the threshold value). As 
the LP's receive the time slice proportional to the priority, the penalised process will 
receive less CPU time. With this new CPU time, "optimism" will be reduced and the 
STW scheduler will change (or not) the priority level for this. 
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5 Case Study and Results 

To validate and evaluate the performance improvement obtained by the proposed 
STW method, we carried out a Ring Service Queuing (RQS) experiment, using two 
different environments: simulation and real execution. By simulation, a PDES includ- 
ing the STW has been simulated to run on a sequential simulator. In real execution a 
PDES including STW has been implemented for Unix and PVM as described in the 
previous section. 

The RQS application is described in terms of a set of logical processes (Fig. 2). A 
RSQ graph has a set of nodes that receives input messages, generates internal events 
and sends output messages. This simple application presents large simulation prob- 
lems under optimistic algorithms due to the high quantity of rollbacks and rollback 
chains that are produced. 




Fig. 2. Ring Service Queuing Model 



5.1 Simulation 

The simulation approach can be suitable in order to make some tests with a higher 
number of LP and processors. For this task, we used the Parallel & Distributed Algo- 
rithm-Architecture Simulator (PandDAAS) developed by the UAB [13]. In Pand- 
DAAS the distributed simulation is specified by an application graph (LP's) whose 
distributed execution is simulated in the sequential simulator PandDAAS. 
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By simulation we analysed RSQ applications from 10 to 500 LP (in the 500 LP 
graph, 15*10^ events have been processed). Table 1 shows the results for both meth- 
ods in function of the time used by distributed simulation. We can verify that in all 
cases the STW method notably reduces the execution time of the simulation. 



Number of Proc- 
esses 


TW Simul. Time 
(virtual time ) 


STW Simul. Time 
(virtual time) 


Simulation Time 
Improvement (%) 


10 


13,567 


11,807 


15 


50 


96,657 


59,373 


53 


100 


256,092 


165,802 


46 


500 


26,455,566 


13,843,024 


91 



Table 1. STW & TW Simulation Time & STW improvement (simulated) 



5.2 Real Execution 

For the analysis in a real environment (SolarisOS & PVM), we used RSQ applications 
with 2, 8, 16 and 20 logical processes (LP) using a pool of Sun SPARCStation. The 
average improvement of STW is 19,5% for 20 LP. 

Table 2 illustrates the execution behaviour (L‘ & 2"“^ rows), the rollback reduction 
(3'^'^ row) and the execution time improvement (4* row) for the RQS experiment of 2, 
8, 16 and 20 LPs for STW and TW methods. We can observe that in the RQS experi- 
ment for 2 and 8 processes, the TW execution time is better than the STW (L‘ and 2"'^ 
columns), but the quantity of rollback messages has been reduced to 24.8% and 30.3% 
respectively. This situation arises due to the fixed overhead that introduces the calcu- 
lation if a process is or is not in an over-optimistic state. 

For 16 and 20 LP’s it can be observed that the STW notably reduces the number of 
rollback messages (3"^“^ and 4* columns). This reduction implies that, under the STW, 
the simulation has an execution time better than that for the TW. The execution time 
improvement increases with the number of LPs. 



TW Execution Time (seconds) 
STW Execution Time (seconds) 
Rollback Reduction (%) 
Execution Time Improvement 
STWvsTW(%) 



2LP 


8LP 


16 LP 


20 LP 


333 


1,554 


3,626 


8,461 


372 


1,573 


3,146 


7,079 


24.8 


30.3 


36.4 


42.9 


-10.5 


-1.2 


15.3 


19.5 



Table 2. Real execution of Time Warp and Switch Time Warp 
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6 Conclusions and Future Work 

An improved method (STW algorithm) for optimistic PDES and its implementation 
using PVM has been presented and evaluated. To overcome rollback overhead in the 
TW mechanism, a solution based on the idea of “limiting the optimism” by distribut- 
ing CPU time has been implemented. We have carried out the implementation of a 
dynamic adaptation of event processing capability (LP speed), based on balancing the 
implied LP relative speed (STW mechanism). An RQS application has been analysed 
using simulation and real execution. By simulation, potential improvement has been 
evaluated (91 % with respect to the TW for 500 LP’s). The average improvement of 
STW is 19,5% in RSQ application for 20 logical processes executed under PVM in a 
pool of Sun SPARCstation. The differences between the simulation results and the 
real results can be attributed initially to the PVM communication model (all messages 
are centralised by the PVM daemon and this daemon is not present in simulation 
runs). In this sense, a change of communication model is necessary to obtain better 
results. The possibilities are the use of direct PVM communication method, or the use 
of another communication library with a different communication model (i.e. MPI 

[7]). 

Future work is divided in two lines: a) Improvement of the real simulation envi- 
ronment: we need to make extensive tests to analyse the influence of the communica- 
tion model, b) Mapping of the processes: The STW has a limit to when all process in a 
processor can be penalised, since the slowest processes are assigned to other proces- 
sors. In this case, a load balancing technique is necessary. 

The authors wish to thank to Pere Munt i Duran for his contribution in the devel- 
opment and implementation of the STW algorithm under PVM. 
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Abstract. Large-scale simulation of hrain activity is based on a general theory 
within the frameork of statistical field theory. The theory and algorithm developed is 
now implemented to a cluster. By extending the computational capacity the 
simulation of normal and pathological cortical activity propagations became possible. 



1. Neural Simulations 

1.1 Neurodynamic System Theory 

Dynamic system theory offers a conceptual and mathematical framework to analyze 
spatiotemporal neural phenomena occurring at different levels of organization, such as 
oscillatory and chaotic activity both in single neurons and in (often synchronized) 
neural networks, the self-organizing development and plasticity of ordered neural 
structures, and learning and memory phenomena associated with synaptic 
modification. 

There are two basic strategies to learn more about integrated neural mechanisms. The 
inverse method start with activity data and results in data on functional connectivities 
among neural structures to be involved. Most of the inverse methods serve 
information about the static relationship of structural connectivities. The direct 
method, namely simulations based on physiologically realistic models of anatomical 
structures supplemented with hypothesis on the structural connectivites among 
substructures, serves simulated activity data to be compared by those derived from 
experimental techniques. Simulators are the proper methods to test hypotheses on the 
Functional Networks and the mechanism of activity generation and propagation 
through different brain regions [1]. 

1.2 Neuro-Simulators: A Short Survey 

There are many neuro-simulators, see eg: 

Neural Modeling Software 

http://www.hirn.uni-duesseldorf.de/~rk/cneuroeu.htm 



J. Dongarra et af (Eds.): EuroPVM/MPI 2000, LNCS 1908, pp. 313-321, 2000. 
© Springer-Verlag Berlin Heidelberg 2000 
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They can be categorized into two classes: 

1.2.1 Conductance-based simulators are proper tools for simulating single cell and 
small network dynamics. NEURON and GENESIS are the most extensively used 
softwares. Detailed single cell modeling had a renaissance after intracellular 
measuring methods were used for recording membrane potentials, and there was a 
hope to record even in different compartments of a single cell. We may consider 
this technique, as microscopic, which cannot be extened to make large-scale 
simulations. 

1.2.2 Neural Network based simulators are generally artificial NN oriented, or 
they can be used for brain simulation without having the chance to incorporate 
realistic data. http://www.geocities.com/CapeCanaveral/1624/ 



2. Neurodynamical (Population) Model: Short History and Present 
Status 

2.1 Brief History 

There is a long tradition to try to connect the "microscopic" single cell behavior to the 
global "macrostate" of the nervous system, analogously to the procedures applied in 
statistical physics. Global brain dynamics is handled by using continuous (neural field) 
description instead of the networks of discrete nerve cells. 

Ventriglia constructed a neural kinetic theory [2]. Having been motivated by this 
approach, a substantially improved new theory, algorithm and software tool were 
established in the Budapest Computational Neuroscience Group [3]. 

Our goals were: 

• to give a general theory within the framework of statistical field theory and 

• a computational model to simulate large-scale neural population phenomena and 
to monitor also the behavior of the underlying "average single cell" 

• to prepare the simulating software 

• to adopt the model for simulating different cortical population phenomena. 

For the population behavior a diffusion model is defined which enables cells that are 
initially in the same state to be dispersed among different states. The model is 
equipped with some important features. Single Cell Model is integrated into the 
Population Equation. Continuum Model is Discretized and Scaled. 

3. The Equations 

The main partial differential equation describing the propagation of the population 
activity in the state space is the following : 
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dgXr,u,X,t) ^ 
d t 

Ssi^ ’U,X ,t) D^ d^ g^r ,u,X ,t) _ 
2 du^ 2 dX^ 

b^r ,u,X ,t)-n^{r ,u,X ,t) 



Where g is a probability density function (PDF) describing the activity at a given 
neural tissue point r, and at time t. The other two variables are random variables of 
the PDF, u stands for membrane potential and X stands for calcium concentration 
of the cell. This two variables constitute a two-dimensional probability density 
function for every point of the neural tissue and for every time point. 

The calcium influx into the cell is: 

ri,{u,X) = -b X-BIcJu) 

and the electric current 






The I / currents are the currents generated by the membrane channels, and 
/ j j. is the synaptic current between population s and s', and defined as 



,u,t) = -y ^.Jr ,t){u-E^,J 

Where y is the post-synaptic conductance. This term includes the interaction 
between the points of the neural issue, so if the activity of a cell is high, it will cause 
other cells' membrane potential to rise, as follows: 



/• /I infinite 

YAr,t)=j </),■(/•') Jo aAr' ,r))A^.Xk,.Xr' ,r),t')dt'dr' 



where we integrate on the whole space of the neural issue and is the 



activity of population s at position r at time t, 4> is the density of the cells in the 
population at given position, stands for the delay of the impact between two 

positions, is the connection strength between the positions. If we excite a 

synapse, the effect of this excitation can be measured for a while on the postsynaptic 
cell, and the amount of this effect is a function of time and descibed by function 




316 



Szabolcs Payrits et al. 



gs 


PDF function of population s 


r 


location in real space 


t 


time 


u 


membrane potential 


X 


calcium concentration 




electric current into the population s 


Is 


calcium influx to the population s 


bs 


ratio of the cells in population s returning from firing 




ratio of the cells in population s which are going to fire 


C„P,B 


constants 


ji j) 


input currents generated by the membrane channels 




synaptic current between population s and s ' 


y 


postsynaptic conductance function 


4>s' 


cell density function of the population s ' 




activity of the population s 




delay of the inpact between two position 


ks's 


connection strenght between two position 




synaptic current function 



4. Discretization, Implementation 

The discretization was done by having a uniform lattice regarding to the position in 
the neural tissue.. We also have a uniform lattice for the probability density function 
by variables u and /T . 

The discretization of the equation describing the electric and calcium currents is 
simply done by sampling the function at the given discrete values of the variables. 

The only remarkable problem arises when we would like to solve the partial 
differential equation at discrete points. We try to do that by playing upon the fact that 
g^{r ,U,X ,t) is a probability density function and our purpose is to preserve at 
least the first two moments of it. Therefore we get a restraint on timestep and lattice 
resolution, as this amounts depend on each other. 
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The implementation is done in C++, having a class for every important object in the 
simulation, so every part of the program can be easily changed if necessary [4]. 

• class AlphaFunction describes the functions representing the synapses 

• class Cell describes the properties of one cell. 

• class ConnectionFunction defines the functions mentioned above 

• class DensityFunction defines the (f) j. function above 

• class Element implements the dynamics of one point in the neural issue 

• class Grid implements the whole neural lattice 

• class Pipe calculates the effect of the excitations a cell gets, it corresponds to the 
function y 



5. How to Paralellize This Problem ? 

The question is whether if it is possible to change the architecture of the program 
above to make it suitable for a distributed environment [1]. The most time-consuming 
piece of the program is the approximation of the partial differential equation 
describing the probability density functions g, and it is apparent that the dynamics of 
one point of the neural issue is relatively independent from the other points, the only 
influence of a point on an other one is through the function y . 

The simplest and trivial change is to split the Element objects, having the interface on 
the main node, and doing the actual calculation on other machines. There is one more 
advantage of having a main node, for the user interface is much easier to implement 
this way. 

We selected the PVM environment, for we have different kind of platforms working 
on, even though the same operating system on every machine. 

We had one main process implementing all the classes but the actual calculation in 
the Element class, and in the constructor of the Element class we started a PVM child 
process, setting the inputs and getting the ouputs via PVM messages. In the 
destructor, we destroyed the PVM process. This way the parallelization is totally 
transparent, it is unseen for any object using the Element objects, it was needless to 
modify those part of the program. 



6. Performance and Optimalizations 

Given the implementaion above we achieved very poor performace. The utilization of 
the total processing power of the machines was pretty low, typically between 20 and 
40 percent, depending on the resolution of the neural lattice, and the usage of the 
main node was 100%. 

One of the main reasons of it was that only moving the Element objects into PVM 
children was not sufficient, for by using big lattices the number of connecton 
functions ( y ) grew quadratically. The calculation of the connections requires 
some processing power, which - because of the delay of a synaptic excitation is 
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several timesteps long - at big lattices slows down the main node. Therefore we had 
to move not only the calculation of the activity function, but also the calculation of 
the connection function to the distributed environment, actually moving the Pipe 
classes to the PVM children. 

We assumed that after this change we exploited the processing capabilities of the 
available machines, but we got only 60-80% processor time useage, even though the 
useage of the main node's processor time remained pretty low, about 10-20 percent. 

It turned out that because of the asymmetrical architecture, while the main node 
routed the excitation data and did miscellenous calculations, user interface 
processing, the child processes ran idle. One way to avoid this would have been to 
transformed the architecture symmetrical, not having main process and implementing 
everything in PVM children. 

As this would have made the implementaion more complex, and we would have lost 
the advatage of the transparency of the parallelization, we wanted to avoid dropping 
the asymetric architecture. We have rather chosen to implement an asynchronous 
architecture. It means that the child processes doesn't calculate the probability density 
function of the actual tlmestep, but the PDF at a later time point, therefore it is not 
needed to wait for the results of the calcuations done on the main node. The following 
figure is to illustrate the difference: 

A main node child 1. child 2. child 3. 
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Fig.l. Time chart for synchronous and asynchronous architecture 



We selected to calculate the following timestep in the child processes. The restraint, 
that every connection between the neural field's points has to be at least one timestep 
long, enables us this 'in advance' processing. 

Given these two optimalizations, the processor time utilization climbed up to nearly 
100%, typically above 95% not dependig on the discretization level. 
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8. Results 

8.1 Hardware 

We had the following hardware environment available for the simulations: 

8.1.1 Sun Ultra 30 

UltraSparc II 250Mhz processor 
128 Mbyte RAM 
100 Mbit Ethernet adapter 
Linux 2.2 kernel 

8.1.2 Linux cluster 

16 discless machines with 366 Mhz Intel processors 
128 MB RAM per machine 
Connected via 100 Mb Ethernet switch 
Linux 2.2 kernel 

Only visible node from outside is the main node, other nodes can be 
accessed through PVM [6], [7] 

The results of our simulations on neural cortical structures [5] - performed by the 
above described cluster system - can be found on our web page at: 
http://www.rmki.kfki.hu/biofiz/imate/duke/. The figures below show the increase 
in performance when multiple nodes are used in the calculations. Clearly, the time 
required for calculation drops inversely with the number of nodes applied. When the 
ratio of speedup is plotted against of the number of applied nodes an almost linear 
performance increase is observable. 




123456789 10 

NR (nurrbsr of nodes) 




This boost in performance was achieved by the application of a cost-efficient cluster 
machine, which is comparable in performance with some multiprocessor architectures 
and is available for the fracture of their price. The table below shows the performance 
advantage of the cluster compared to the Sun Ultra 30. 
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number of gridpoints 



The computational efficiency could be decreased by the slow communication 
between the nodes. The figure below demonstrates low, 7-8% utilization of the 
communication capacity in reasonable discretization level. 



9. Conclusions 

Cluster systems and parallelization based on message-passing protocols are cost- 
effective solutions for neural simulations, as this simulations can be typically split 
into good separatable parts. 
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Abstract. This paper presents an environment for distributed genetic 
programming using MPI. Genetic programming is a stochastic evolu- 
tionary learning methodology that can greatly benefit from parallel/dis- 
tributed implementations. We describe the distributed system, as well 
as a user-friendly graphical interface to the tool. The usefulness of the 
distributed setting is demonstrated by the results obtained to date on 
several difficult problems, one of which is described in the text. 



1 Introduction 

Genetic Programming (GP) is a new evolutionary computing approach aimed at 
solving hard problems for which no specific algorithm works satisfactorily. GP 
is an heuristics based on the principles of biological evolution where individuals 
are selected for survival and reproduction according to their adaptation to the 
environment. GP considers the evolution of a population of computer programs 
which can potentially solve a given problem. Specific operators are defined in 
order to implement program mutation, crossover and selection. By defining a 
fitness measure to be attached to each individual in the population and by biasing 
the evolution towards fitter individuals, the iterative use of these operators drives 
the process towards programs that solve better and better the problem at hand. 
The GP approach was proposed by Koza at the end of the 1980s jSl and is now 
developing rapidly both in academia and industry. Individual programs in GP 
are expressed as parse trees using a restricted language that fits the problem to 
be solved. This language is formed by a user-defined function set F and terminal 
set T chosen such that it is thought to be useful a priori for the problem at hand. 

As an example, suppose that we are dealing with simple arithmetic expres- 
sions in three variables. In this case suitable function and terminal sets might be 
defined as: F = {-b, — , /} and T = {A, B, C}. Some possible GP trees arising 

from these sets are shown in Figure n where the genetic operation of crossover 
(to be explained later) is also illustrated. 

Evolution in GP is as follows. An initial random population of trees (pro- 
grams) is constructed. A fitness value is assigned to each program after actual 
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execution of the program (individual) and genetic operators are applied to se- 
lected individuals in order to produce new ones. The population size usually 
stays constant: new fitter individuals replace bad individuals. This cycle goes 
on until a satisfactory solution has been found or another termination criterion, 
such as maximum computing time, is reached. The aim in GP is to discover a 
program that satisfies a given number m of predefined input/output relations: 
these are called the fitness cases. For a given program pi its fitness fj{pi) on the 
j-th fitness case represents the difference between the output pj produced by the 
program and the correct answer Gj for that case. The total fitness f{pi) is the 
sum of the errors over all m fitness cases: f{pi) = Y^=i II 9k — Gk ||. A better 
program will thus have a lower fitness under this definition, and a perfect one 
will score 0 fitness. 

The crossover operation starts by selecting a random crossover point in each 
parent tree and then exchanging the sub-trees, giving rise to two offspring trees, 
as shown in Figure^ Mutation is implemented by randomly removing a subtree 
at a selected point and replacing it with a randomly generated subtree. 




Fig. 1. Example of GP trees and of the crossover operation. 



Although artificial evolution by GP has been cast as a sequential process for 
descriptive purposes, such evolution is intrinsically parallel. Genetic program- 
ming evolution is a robust but slow process. Hence, parallel execution is a wel- 
come solution to reduce computing time. Parallel and distributed GP settings 
may also bring advantages from the algorithmic point of view. There are few 
studies in the field: early ones are P, where a today obsolete Transputer-based 
parallel computer is used and |0I . Initial work by our group based on PVM with 
more restricted features and without graphical monitoring tools is described in 
0. Here we present a new and richer implementation of our distributed GP 
system using MPI. The present environment features a graphical user interface 
whose aim is both to make its use more intuitive for novices and to allow expert 
users to closely monitor the evolutionary process. 

In section 0we describe general modeling issues in parallel and distributed ge- 
netic programming. Section El gives details on our implementation of distributed 
GP. Following that, we describe the graphical user interface and monitoring tool. 
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Finally, a sample appplication to a difficult problem is commented and we offer 
our conclusions. 

2 Distributed Genetic Programming 

Genetic programming can be readily parallelized by introducing multiple com- 
municating populations, in analogy with the natural evolution of spatially dis- 
tributed populations. This model is called island or coarse-grain. Subpopulations 
may exchange information from time to time by allowing some individuals to 
migrate from one subpopulation to another according to various patterns. The 
main reason for this approach is to periodically reinject diversity into otherwise 
converging subpopulations. As well, it is hoped that to some extent, different sub- 
populations will tend to explore different portions of the search space. Within 
each subpopulation a standard sequential genetic programming evolutionary al- 
gorithm is executed between migration phases. The most common replacement 
policy is for the migrating n individuals to displace the n worst individuals in the 
destination subpopulation. The subpopulation size, the frequency of exchange, 
the number of migrating individuals, and the migration topology are new pa- 
rameters of the algorithm that have to be set in some empirical way 



7 8 9 





Fig. 2. Two commonly used distributed GP topologies: a) the “ring” topology, 
b) the “mesh” topology. Arrows represent message exchange patterns. 

A few migration topologies have been used: the “ring”, in which popula- 
tions are topologically disposed along a circle and exchanges take place between 
neighbouring subpopulations and the “grid” where “meshes of islands”, possi- 
bly toroidally interconnected, communicate between nearest neighbours. These 
topologies are illustrated in Figure El One possible drawback of these static 
topologies is that some bias might be introduced by the constant exchange pat- 
tern. Dynamical topologies, where destination nodes change during time seem 
more useful for preserving diversity in the subpopulations. We have used with 
success a “random” topology, where the target subpopulation is chosen randomly 
at each migration phase (|2| and see next section). 
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The performance increase in terms of computing time can be modeled as fol- 
lows. The sequential GP runtime of a population of P individuals is deter- 
mined by the genetic operators and by the fitness evaluation. Selection, crossover 
and mutation take time 0{P) since all these operators act on each individual 
and perform a transformation independent of the program size. The fitness cal- 
culation is the most important part and its complexity Tfunessis 0{PCm): each 
individual is evaluated m times and each evaluation takes on average C arith- 
metic or logical operations. Here C is the average program complexity (i.e. the 
number of nodes in the tree structure) and m is the number of fitness cases. 
The total sequential time is thus T'®®'? = gTpopuiation + gT fitness, where g is the 
number of generations. 

Consider the island model where N populations of P/N individuals each 
are equally distributed on a system of N machines. Now the genetic opera- 
tors take time 0{P/N) while fitness evaluation is 0{{P/N)Cm). The only over- 
head is given by the communication of migrating individuals which takes time 
0{k{P/N)C). Here k ~ 0.05 is an empirical constant which represents the frac- 
tion of migrating individuals; thus, communication time is small with respect to 
the other terms. 

Finally, if we consider an asynchronous island model where migration takes 
place with non-blocking primitives, the communication time is almost completely 
overlapped with computation and can be neglected to first approximation. There- 
fore, Tf^tless = i^/N)TfUness ^nd Tp^ulation = / ^)TpopulaUcn^ g^fog nearly 
linear speedup. Of course, the preceding argument only holds for dedicated par- 
allel machines and unloaded clusters and does not take into account process 
spawning, message latency time and distributed termination. Nevertheless, GP 
is an excellent candidate for parallelization as shown by the results presented 
here and in miEq. 



3 MPI Implementation 

The implementation of the tool described in this work can be divided into two 
components: a parallel genetic programming kernel implemented in G-l — h and 
with MPI message passing, and a graphical user interface written in Java. The 
parallel system was designed starting from the public domain GPG-I--I- package 
PJ. Here we present the kernel and its parallelisation strategy, while the graphical 
monitoring tool is described in the next section. 

The computation can be basically thought of as a collection of processes, 
each process representing a population for the specific genetic programming 
problem. The processes/populations can be evolved in parallel and exchange 
information using the MPI primitives. The messages exchanged by these pro- 
cesses are groups of GP individuals and the communication happens through 
another process called the master that runs in parallel with the others and that 
implements a given communication topology. The master also sends termination 
signals to the other processes at the end of the evolution. In this configuration, 
each process/population executes the following steps: 



326 



Francisco Fernandez et al. 



While termination condition not reached do in parallel for each population 

— Create a random population of programs; 

— Assign a fitness value to each individual; 

— Select the best n individuals (with n > 0) and send them to the master; 

— Receive a set of n new individuals from the master and replace the n worst 
individuals in the population; 

— Select a set of individuals for reproduction; 

— Recombine the new population with crossover; 

— Mutate individuals; 

And the master process executes the following steps: 

For each population do 

— Receive n individuals; 

— Send them to another population according to the chosen topology; 

Before sending the individuals to the master, each population packs these trees 
into a message buffer. The master receives the buffer and directly sends it to an- 
other population. In this way, the data can be exchanged between processes with 
only one send and receive operation, and the packing and unpacking activities 
are performed by the population managing processes. The user can parameter- 
ize the execution by setting the value of n, the number N of individuals in each 
population and the communication topology among others. The comunication 
between the processes/populations and the process/master is synchronous in the 
sense that all the processes/populations wait until they have received all the n 
new individuals before going on with the next iteration. 

One important feature of the system is that it allows to easily model several 
communication topologies, such as those depicted in Figure |21 The communi- 
cation paths in these mesh and ring topologies are implemented by the master 
process. 

To implement the random topology, the master, each time it receives a block 
of individuals from a population calculates a random number between 1 and 
the total number of processes and sends the block to the population whose MPI 
process ID corresponds to that number (the process ID of the master is 0). In 
order to promote fairness, a second constraint that is enforced by the system 
is that each population must receive a block of individuals before an exchange 
cycle is finished. 

4 Graphical User Interface and Monitoring Tools 

Most evolutionary computation environments do not feature a graphical user 
interface (GUI). This is inconvenient since parameter setting and other choices 
have to be done in old-style file-based fashion, which is obscure and difficult for 
the beginners to work with. Even the experienced researcher may benefit from 
a more user-friendly environment, especially if she wishes to closely monitor the 
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Fig. 3. The monitoring graphical user interface. 



complex evolutionary process since this might shed light into the nature of the 
evolution itself. 

Our GUI is written in standard Java and it was designed so as to be clean and 
easy to use. It communicates with the computation kernel through bi-directional 
channels and also starts the distributed computation. Information is displayed 
on a window featuring the actions that the user can follow, an example of which 
is given in Figure 13 The following is a succint description of the actions and 
the information available to the user. The parameters for the run can be entered 
using the text fields for that purpose and if some parameter is not provided the 
system warns the user. Pre-defined standard default parameter settings are also 
proposed. Some less important parameters, which also have default values, can 
be set from a second window that appears by clicking the “options” button. Run- 
time quantities such as best and average fitness, average program complexity and 
size of the trees can also be calculated and displayed at any time during the run. 
The example in Figure El shows a graph of the average and best fitness for the 
population as a whole. The interface can also display the tree corresponding to 
the best current solution in raw or simplified form. The topology can be chosen 
from a list in the panel “connection topology” and an icon on the window shows 
the type selected (“ring” in the Figure). Facilities for end-of-run calculation 
of several useful statistics are also provided. We plan to add the possibility of 
examining statistics for each single node by clicking on the corresponding icon or 
by using a node number in a list. Color codes can also be useful for representing 





328 



Francisco Fernandez et al. 



different states of the evolution process or to visualize nodes that are receiving 
or sending messages. 



5 Results and Conclusions 

The environment currently runs on a cluster of PCs under the Linux operating 
system, as well as on Sun workstation clusters. It has been tested on a number of 
problems, including difficult financial prediction applications |2| . In this section, 
we describe the results obtained on the Even Parity 5 problem. The boolean 
even-k-parity function of k boolean arguments returns True if an even number 
of its Boolean arguments are True; otherwise it returns NIL. 



5 Even Parity Problem 




\ 1 Pop 
2 Pop 
\ 4 Pop 
\ 8 Pop 



Fig. 4. Convergence results for the 5 Even Parity Problem, with 3200 individuals 
subdivided into 1, 2, 4, 8 populations. 



Measurements have been performed on a population of 3200 individuals, 
on 2 populations of 1600 individuals each, on 4 populations of 800 individu- 
als each and on 8 populations of 400 individuals each. The other key param- 
eters of the runs were: random communication topology, maximum number of 
generations= 100, crossover rate= 0.95, mutation rate= 0.1, tournament selec- 
tion. Figure 0 shows the average fitness as a function of the evolution time for 
the four implementations. Since GP is a stochastic algorithm, these curves rep- 
resent averages over twenty different runs. These results show clearly that the 
distributed setting gives faster convergence toward the optimal solution (fitness 
= 0). In particular, we note that the 4- and 8-populations cases show faster 
convergence than the single population or 2-populations cases. One might be 
led to believe that further partitioning of the populations is always beneficial. 
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However, we have shown elsewhere 0 that for many problems there is a con- 
venient total number of individuals and of subpopulations that gives the most 
cost-effective results for a given predetermined computational cost. Thus, it is 
not useful to keep adding individuals or to distribute them into more smaller 
populations above a certain limit. 

Although there are also time savings due to parallel execution (see section]^, 
we did not measure speedup data since the cluster of workstations employed for 
the experiments was always in use for many other processes as well. 

Currently, the system is being used on real-life machine learning applica- 
tions, especially in the field of finance. As well, we use the system to perform 
experimental studies of distributed GP on several classical benchmark problems. 
This is a part of a longer term project whose aim is a better understanding of 
the dynamics of multi-population GP by way of experiment and by theoretical 
modeling. In the future, we plan to extend the capabilities of the system towards 
totally asynchronous execution and to a geographically enlarged metacomputing 
framework. 
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Abstract. Pricing options often requires use of Monte Carlo methods 
in financial industries. We describe and analyze the performance of a 
cluster of personal computers dedicated to Monte Carlo simulation on 
the evaluation of financial derivatives. Usually, Monte Carlo simulation 
(MCS) requires too much computer time. This requirement limits most of 
MCS techniques to use supercomputers, available only at supercomputer 
centers. With the rapid development and low cost of PCs, PC clusters are 
evaluated as a viable low-cost option for scientific computing. The free 
implementation of PVM is used on fast ethernet based systems. Serial 
and parallel simulations are performed. 



1 Introduction 

Among the different numerical procedures for valuing options, the Monte Carlo 
simulation is well suitable for the construction of powerful pricing models. It 
is especially useful for single variable European options where, as a result of a 
non-standard pay-out, a closed-form pricing formula either does not exist or is 
difficult to derive. In addition, the price of complex options is sometimes difficult 
to explain intuitively and a simulation can often provide some insight into the 
factors that determine the pricing. 

The commonly used Monte Carlo simulation procedure for option pricing can 
be briefly described as follows: firstly simulate sample paths for the underlying 
asset price; secondly compute its corresponding option payoff for each sample 
path; and finally, average the simulated payoffs and discount the average to yield 
the Monte Carlo price of an option. 

An option is a contract that gives you the right to buy or sell an asset for 
a specified time at a specified price. This asset can be a ’’real” asset such as 
real estate, agricultural products, or natural resources, or it can be a ’’financial” 
asset such as stock, bond, stock index, foreign currency, or futures contract. 
Essentially, by buying the option, you transfer your risk to the entrepreneur 
selling you the options. 

Therefore, an option is a contract between two parties: a buyer and a seller 
(or option writer) . The buyer pays to the seller a price called the premium, and 
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in exchange, the option writer gives to the buyer the right to buy or sell some 
underlying securities at some specified price for some specified period of time. 

An option to buy is a call option, and an option to sell is a put option. The 
specified price is called the strike or exercise price, and the option’s life is called 
the time to maturity or time to expiration. If the right to exercise is ’’any time 
until maturity” then the option is called an American option. If the right to 
exercise is ”at time of maturity” then the option is called an European option. 



2 The Black-Scholes Model 

We denote by S{t) the stock price at the time t. In the certainty case, the stock 
price at the time of the option’s maturity T, equals the future value of the 
stock price 5(0) when continuously compounded at the risk-free interest rate r, 
S{T) = 5(0) e”^. One way to think about this is that the future value is the end 
result of a dynamic process. That is, the stock price starts at 5(0) at the present 
time 0, and evolves through time to its future value. 

The formal expression that describes how the stock price moves through 
time in the certainty case is: dS/dt = rS (this says that the rate of change of the 
stock price over time is proportional to the stock price at time t) . The expression 
above describes the dynamic stock price process in a world with certainty. We 
can rewrite the equation as: dS/S = rdt. In this form, r is the instantaneous 
rate of stock’s return (r is also called the drift rate of the stock price process). 

If instantaneous return is r, its logarithm is a continuously compounded 
return. For the case of no uncertainty, the drift rate for the logarithm of this 
process is the same, but in an uncertain world this is not the case. 

In a world with uncertainty and risk-averse investors, we expect that the 
instantaneous return from the stock, noted fi, will exceed the instantaneous 
risk-free rate of return (i.e. fj, > r). We must add a source of randomness to 
the instantaneous rate of return which has statistical properties that capture 
the fact that observed stock prices vary, and that a typical stock price path has 
variance which increases with time. 

Black and Scholes |2| assume a model for stock price dynamics that is formally 
described as geometric Brownian motion. This model has the following form: 

7 O 

— = fj,dt + adW , tG[0,T] (1) 

o 

where the parameters fi and a are constant with respect to t and S. Here there 
are two factors that affect the instantaneous rate of return on a stock. The first 
one is the time. Over the period of time dt, the stock’s return changes by the 
amount ndt. The second factor is uncertainty. The sensitivity to this source of 
uncertainty is captured by the term a which is the volatility coefficient for the 
stock price. The net effect of adding the term adW to the certainty model is to 
create a stochastic path for stock prices around the certainty path. Uncertainty 
in the model is added to let the model better satisfy properties exhibited by real 
world stock price. 
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Let f{S,t) denote the value of any derivative security (e.g. a call option) at 
time t, when the stock price is S{t). Using Ito’s lemma, 






^df 1 2c-2^V 
^ 2"" ^ dS^ 




dt 



and assuming that there are no arbitrage opportunities on the market, we get 
the following parabolic partial differential equation (PDE)): 



^ _L i 2 c2 d^f 

dt ^ 2^^ ^ dS^ 



rf = 0 . 



This is called the Black-Scholes partial differential equation. Its solution, 
subject to the appropriate boundary condition for /, determines the value of the 
derivative security. For an European call option, the boundary condition is 



f{S,T)=ma.x{0,S{T)-K} , 



where K is the strike price and T is the time to expiration. 

The solution of the above PDE is given by the Black-Scholes formula : 



C = SN{di) - Ke~^^N{d2) , 



(2) 



where 



di 



logjS/K) + (r + aV2)r 

aVr 



d2 = di — aVr 



and N is the cumulative standard normal distribution. 

Therefore, there are five parameters which are essential for the pricing of an 
option: the strike price K, the time to expiration T, the underlying stock price 
S, the volatility of the stock a, and the prevailing interest rate r. 

In some cases, the type of option is so complicated (for example, in /i 
or a are random) that the solution of the PDE is very difficult to be found. 
When this is the case, it is nearly always possible to obtain the option price 
by an approximation using an appropriate - maybe computationally intensive - 
numerical method. The standard methods are discussed in Pj. 



3 Monte Carlo Simulation 

Usually, for solution of financial problems, the Monte Carlo Simulation (MCS) 
methods are used (e.g. to value European options and various exotic derivatives). 
But, because such kind of problems are very complicated, the MCS algorithms 
becomes too computational ’’expensive”. On the other hand, due to the inherent 
parallelism and loose data dependencies of the above mentioned problems, Monte 
Carlo algorithms can be very efficiently implemented on parallel machines and 
thus may enable us to solve large-scale problems which are sometimes difficult 
or prohibitive to be solved by the other numerical methods. For implementing 
Monte Carlo method to price European options, the following procedure is used: 
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— for the life of the option, simulate sample paths of the underlying state 
variables in the pricing model, 

— for each path, calculate the discounted cash flows of the option 

— take the sample average of the discounted cash flows over all sample paths. 

For pricing European calls, we have to compute E{S{T)—K)^, where (S{T) — 
iC)+ = max{ 5'(T) — iF, 0}. Then, we compute the call value C by 



As we already have mentioned in Section 2, the assets follow a geometric 
Brownian motion and the stock price S{t),t G [0, T] are log-normally distributed. 
For a given partition 0 = to <^i < ••• < tN = T of the time interval [0,T], a 
discrete approximation to S{t) is a stochastic process St satisfying 



for k = 0, 1, ..., TV — 1 (where we used the subscript k instead of the time step 
subscript tk), with St = tk+i — tk = T/N, and initial condition 



where 

— Zq, Zi , ..., Ztv -1 are independent standard normal random variables, and 

— a'/MZk represents a discrete approximation to an increment in the Wiener 
process of the asset. 

The call price estimate is then computed using the discount formula (0. After 
repeating the above simulation for a large number of time steps, the initial call 
value is obtained by computing the average of estimates for each simulation. The 
disadvantage of the Monte Carlo simulation for European options is the need of 
a large number of trials in order to achieve a high level of accuracy. 

4 Cluster Architecture 

Clusters of computers (workstations) constructed from low-cost platforms with 
commodity processors are emerging as a powerful tool in computational science. 
These clusters are typically interconnected by standard local area networks, such 
as switched Fast Ethernet. Fast Ethernet is an attractive option because of its 
low cost and widespread availability. However, communication over Fast Eth- 
ernet incurs relatively high overhead and latency. But, for our above described 
problem, the communication requirements are insignificant. 

We used a cluster of 10 PCs, each of them having Compaq Deskpro Intel 
200MHz Pentium-MMX processor, with 64MB of RAM. For the Fast Ethernet 
networking we used a 3Com Fast EtherLink XL 10/ 100Mb TX (Ethernet NIC 
3C905B-TX) PCI network cards and a 3COM Super Stack II Baseline 10/100 
Switch 3C16464A switch. 

The Windows 95 distribution was used. Three years ago, this cluster costed 
$10,000 but with the current depreciation of computer hardware it is now much 
cheaper. 



C = exp{-rT)E{S{T) - K)+ . 



( 3 ) 




( 4 ) 



S{0) = S{0) , 



( 5 ) 
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5 Parallel Implementation 

The Monte Carlo method for pricing derivatives is an ideal candidate for the 
use of PVM software. As stated above, the Monte Carlo can take up to hours to 
run. Using PVM (□> El)> this disadvantage can be eliminated. 

We implemented parallel Monte Carlo simulation on a cluster described in 
Section 4 before under PVM and we applied master/slave approach. As we have 
already mentioned in the above Section 4, the MCS for pricing option allowed 
us to have minimal communication, i.e. to pass to each processor only the pa- 
rameters: 

— SO — initial underlying stock price, 

— K = strike price, 

— a = volatility of the stock, 

— r = prevailing interest rate, 

— St = length of interval in uniform Wperiod partition for time T, 

— T = time to expiration, 

— NSIMP — number of Monte Carlo simulations per processor, 

to run the algorithm in parallel on each processor by computing the option’s 
value for NSIMP simulations and, at the end, to collect the results from slaves 
without any communication between sending the parameters and receiving the 
call value. The only communication is at the beginning and the end of the al- 
gorithm execution which allows us to obtain very high efficiency for our parallel 
implementation. 

This algorithm was implemented using the PVM MASTER/SLAVE model. 
The MASTER program is responsible for sending/receiving parameters to/from 
slaves and computing the final value. Each SLAVE receives from the MASTER 
program the parameters it needs for computation, computes and sends back the 
results. 

There is a worker process (task) per node (processor); the master process 
can either share a node with a worker process or run on a dedicated node. 

M. The MASTER program 

1. The MASTER process send to all SLAVE processes the parameters: 5*0, 
K, a, r, 6t, T, NSIMP which are necessary for calculating C; 

2. The MASTER process receives from each SLAVE process the computed 
value of C ; 

3. The MASTER process computes the final option price. 

S. The SLAVE program 

1. Each SLAVE process receives parameters from the MASTER process for 
computing initialization; 

2. The SLAVE process performs its local computation (to evaluate C, using 
(EJ, 0) and (El ) ; 

3. The SLAVE process sends the results back to the MASTER process. 



Experiments with Parallel Monte Carlo Simulation 



335 



6 Numerical Tests 

We performed a simulation study using the Black and Scholes model. The Monte 
Carlo option price for a given N SIMP and p processors can be computed ac- 
cording to above algorithm. 

The numerical tests were made on a cluster of 10 PC, under PVM, for r = 
0.07, 5'0 = 100.00, K = 95.00, a = 0.20 and T = 0.25. The theoretical value 
computed according to ( 0 ) is ST = 8.056. We tested the methods in the European 
case because the true price can be analytically determined. 

For St G {10“^,10“^} we generated Monte Carlo option price estimates. 
These prices were computed with 10^, ..., 10® sample paths in order to examine 
the impact caused by different numbers of sample paths. For each test we priced 
options and computed the error = theoretical value - Monte Carlo simulation 
value. The numbers in Tableland Table |21 represent the errors for St = 0.001 
and St = 0.01, respectively. The first column in both tables indicates the number 
of simulation steps per processor. 



Table 1. Errors for Monte Carlo Simulation using St = 0.001 



P 

NSIM 



10 



-0.474 -t0.909 -t0.650 -t0.197 -0.998 -1.811 
-0.004 -t0.179 -0.398 -t0.018 -0.271 -tO.283 



10^ -3.700 -tl.312 -2.322 -0.950 

10^ -0.889 -0.273 -0.306 -0.327 

10^ -t0.041 -t0.113 -0.060 -0.135 -t0.114 -t0.052 -t0.065 -t0.088 -0.135 -t0.078 

10"^ -t0.059 -0.037 -t0.043 -0.049 -t0.031 -t0.004 -t0.005 -tO.OOO -0.039 -0.022 

10® -tO.008 -t0.026 -t0.007 -0.004 

10® -t0.005 -t0.005 -0.000 -0.004 



-0.007 -t0.009 -t0.004 -t0.002 -0.016 -t0.003 
-0.004 -0.004 -0.007 -t0.006 -0.001 -tO.OOl 



Table 2. Errors for Monte Carlo Simulation using St = 0.01 



P 

NSIM 



10 



10^ -2.970 -2.168 -t0.006 -t0.158 -1.853 -0.400 -1.038 -t0.030 -1.718 -0.743 

10^ -0.830 -0.200 -tO.557 -t0.302 -tO.273 -0.170 -tO.568 -0.377 -tO.252 -tO.239 

10® -0.021 -0.120 -0.195 -0.248 -tO.228 -0.082 -t0.006 -0.044 -0.080 -0.145 



10"" -t0.044 -t0.051 -t0.045 -0.005 -t0.061 -0.007 
10® -0.014 -0.031 -0.014 -t0.004 -tO.OOO -0.000 
10® -t0.003 -t0.005 -t0.007 -t0.003 -0.002 -0.004 -t0.002 -0.000 -t0.004 -tO.OOO 



-0.011 -0.009 -0.055 -tO.OOO 
-0.012 -0.028 -t0.004 -0.007 



We can observe that the error decreases when the number of sample paths 
increases. Unfortunately, in the same time we have to observe a disadvantage 
of the Monte Carlo simulation for European options. This consists in the large 
number of trials necessary to achieve a high level of accuracy. 
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In all tests we used the gasdev routine for generating random deviates with a 
normal distribution (Box-Muller method) and the pseudo random generator ran2 
as the source of uniform deviates (the long period random number generator of 
L’Ecuyer with Bays-Durham shuffle and added safeguards) 0. The results are 
sensitive to the initial seed. Figure ^shows the results of Monte Carlo simulation 
obtained with randomly chosen initial seeds. 





123456789 10 



23456789 10 



Fig. 1. Computed option price with respect to number of processors, using MCS 
method for 10^, fc = 1, ..., 6 sample paths (upper: 5t = 0.001, down: 5t = 0.01) 



Theoretically, we can obtain an arbitrary degree of accuracy. But, from a 
practical view point, higher is the level of accuracy, bigger will be the compu- 
tational effort for the MCS algorithm. This is, of course, due to the well-known 
fact that the standard error of Monte Carlo estimate is inversely proportional 
to the square root of the number of simulated sample paths. 

The quality of the random number generator is essential. We used pseudo 
random generator. This random generator often requires a very large number 
of simulation repetitions to minimize errors. It is possible to use quasi random 
generators that are designed simply to fill the space in an interval more uniformly 
than uncorrelated random points. 

In a more formal sense, using quasi random generators we can reduce the 
error associated with a simulation from 0{l/^fN) to 0{1/N). 
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The parallel efficiency E, as a measure that characterizes the quality of the 
parallel algorithm, is defined as: E = , where Tp is the value of the compu- 

tational time for implementation the algorithm on a system oip processors. The 
parallel efficiency and the percentages of work, random generation and latency, 
in all Monte Carlo repetitions, are given in Table 0 . 



Table 3. Computing time and efficiency for 10,000 simulations and 5t = 0.001 



Time in seconds 


% random generator 


% work 


% latency 


efficiency 


13 


62 % 


37 % 


1 % 


99 % 



We observe that a substantial computing time is used to generate random 
numbers. The time required to perform 10.000 sample paths depend on the 
latency of the network; it is variable because the network is a shared resource. 
But, when network is ’’free”, the latency is very small. 

7 Conclusions 

The problem of pricing options by parallel Monte Carlo numerical methods is 
considered. Numerical tests were performed for a number of PCs using PVM on 
a cluster of personal computers. 

This study describes an application of parallel computing in the finance in- 
dustry. Options are continuously growing more complex and exotic, and for an 
increasing number of pricing problems, no analytical solutions exist. This is 
where the advantage of Monte Carlo methods appears. 

Parallel models are required for performing large scale comparisons between 
model and market prices. Parallel models are useful tools for developing new 
pricing models and applications of pricing models. 

In our parallel implementation we calculated one price of the call option. 
To compute this price by Monte Carlo simulation we need more computational 
power. Using p processors the execution time is p times small. 
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Abstract. An exact three dimensional quantum reactive scattering com- 
putational procedure (APH3D) aimed at calculating the reactive prob- 
ability for atom diatom chemical reactions has been parallelized. Here, 
we examine the structure of the parallel algorithms developed to achieve 
high performances on MIMD architectures. 



1 Introduction 

A key goal of modern chemical investigation is the rationalization of molecular 
processes, since this is the base for modeling several technological and envi- 
ronmental applications. Among these, of great importance are those leading to 
chemical reactions. Although the formalism for treating exactly reactive pro- 
cesses is fully established since long time, its complex algorithmic formulation 
and its computationally intensive numerical implementation have strongly lim- 
ited advances in this field even for elementary atom diatom reactions Q . There- 
fore, an active line of research is the investigation on how, for this class of 
problems, innovative numerical approaches can be designed and related compu- 
tational procedures can be implemented on parallel architectures. To guarantee 
the portability of the code, use was made of the MPI paradigm. 

Goal of this paper is to illustrate the advances of a project aimed at im- 
plementing a parallel version of a full dimensional quantum reactive scattering 
computational procedure for atom diatom systems. In section 2 the mathemati- 
cal foundations and the algorithmic structure of the related codes are presented. 
In section 3 the modifications needed for a parallel organization of the programs 
are illustrated. In section 4 performances of their parallel implementations are 
discussed. 

2 Mathematical Foundations and Algorithmic Structure 

From a mathematical point of view, the reactive scattering problem can be 
reconducted to the integration of a 9 dimensional differential equation, once 
that the motion of the electrons has been decoupled (Born Oppenheimer ap- 
proximation [ 2 | ) . 
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By taking advantage of the invariance of the center of mass motion and 
the Constance of both the energy E and the total angular momentum J ( J is 
its quantum eigenvalue and M its projection on the z axis of the space fixed 
coordinate frame), the equation defining the n-th partial wave of the system 
tpJMpn parity p in three dimensions can be reduced to: 

fJipJMpn ^ j^^JMpn 

In equation m H is the Hamiltonian operator that, in terms of the Adiabatically 
adjusting Principal axis Hyperspherical (APH) coordinates |3j ^ |H|, can be 
partitioned as follows: 



H = fp + n + fr + t + v{p, e, x) ( 2 ) 

with p being the hyperradius, 0 and x the internal hyperangles, and Tp, T/,, T^, Tc 
the various terms of the kinetic operator describing its radial, angular, rotational 
and Coriolis components, respectively. The V (p, 9, x) term is the potential energy 
function describing the interaction between the three atoms generated by solving 
the (separated) electronic problem at various nuclear geometries. 

Equation da is solved by expanding the partial wave as products of 

the Wigner rotational functions D'^j^ depending on the Euler angles (a, /3, 7 ) 
(these angles describe the spatial orientation of the coordinate system integral 
with the plane formed by the three particles, and A is the projection of J on the 
z axis of these Body Fixed coordinates), of the surface functions <P depending 
on the 9 and y hyperangles (at fixed value of the hyperradius) and of functions 
ip{p) depending on p and carrying the scattering information: 

^JMpu = ( 3 ) 

t,A 



To perform the numerical integration the hyperradius interval is partitioned 
into several small sectors. For each sector e the surface functions x', Pe) 

are computed at the sector midpoint pe by solving the eigenvalue (f:/^) equation: 






Th+^ + n^GJ{J + 1) + h^FA^ + E(p„ 0, y) - e/P(p,) 
Sppj 



Pe) = 0 



( 4 ) 

where p is the reduced mass of the system, G and F are coefficients depending 
on the mass and the geometry of the triatom. Equation 0 ) is solved by applying 
the Analytic Basis Method |0I that expands the surface functions in terms of a 
basis set of analytic functions centered on each arrangement channel. 

The substitution of expansion into equation da leads to the following set 
of coupled differential equations: 






FtA 






t'A' 



^tA^AM\^P^t' 



Jp f.Jp 
A'^A'M 



^i^A'ip) (5) 
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with Hi being: 

« - - - 

Hi = Th + Tr + Tc+ - — - + V{p,6,x)- ( 6 ) 

8fip^ 

The set of coupled equations 0 is then solved by propagating the solution 
matrix using the Logarithmic Derivative Method from a small value of p 

to a large asymptotic one. There, a mapping of the solution matrix into Delves 
coordinates P| and from Delves into Jacobi coordinates is performed. Then, the 
scattering matrix S is determined by imposing asymptotic boundary conditions. 

The related computational procedure (APH3D) is articulated into two large 
programs (ABM and LOGDER) and other few small ones. ABM is the program 
devoted to the calculation of surface functions and related eigenvalues. LOGDER 
is the program devoted to the propagation of the solution matrix. Remaining 
programs perform all the transformations necessary to evaluate S. Recently, 
the problem of parallelizing ABM was tackled ■mini and partially satisfactory 
results were obtained. Yet, the work of parallelizing simultaneously both ABM 
and LOGDER was not considered before. 

3 The Parallel Implementation 

The serial version of ABM consists of two nested loops: the outer loop runs over 
the sector index; the inner loop runs over the values of the projections A. 

For each sector, the program determines the value of the hyperradius at the 
sector midpoint (pe), integrates equation and calculates the coupling matrix 
that is then stored on disk. The scheme of the ABM program is therefore: 

Read input data 

Calculate quantities of common use 
LOOP on sector index 
Calculate 
LOOP on A 

Construct the basis set at p^ 

Solve equation (0) to generate surface function at p^ 
lF(not first sector) then 

Calculate overlaps with surface functions at pe_i 
Store on disk overlap matrix 
END if 

END loop on A 

Calculate the coupling matrix 
Store on disk the coupling matrix 
END loop on sector index 

As apparent from the above scheme, the computational feature inhibiting 
parallelization (provided that the calculation of the surface functions fits into 
the node memory) is the calculation of the overlap integrals between surface 
functions of the current sector and those of the previous one. To do this, the 
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eigenvectors of the surface functions of the previous sector need to be used to 
evaluate those surface functions at the grid points of the present sector. This 
difficulty can be overcome by assigning statically blocks of sectors to each node 
and repeating the calculation of the surface functions of the previous sector only 
for the first sector of each block El. 

In a recent paper m we discussed the superiority of adopting a dynamical 
scheduling of the work load. To do this, however, the calculation of the surface 
functions of the preceeding sector needs to be repeated at each p value. In the 
improved version of ABM needs to be repeated only the calculation of the ba- 
sis set for the preceeding sector, thus avoiding the solution of the Schrodinger 
equation for this sector. The resulting structure of the master process is: 

Read input data 
Send input data to all slaves 
LOOP on sector index 
Calculate 
Call MPl_SEND(pJ 
END loop on sector index 

while the slave process is : 

Recv input data 
10 Call MPl_RECV(pe) 

Calculate 
LOOP on A 

Construct the basis set at p^ 

Solve equation (EJ) to generate surface functions at p^ 

Store on disk eigenvalues and eigenvectors 

Call MP1_BARR1ER 

lF(not first sector) then 

Construct the basis set at pe_i 
Read eigenvectors at pc_i 
Compute overlap integrals 
Store on disk the overlap matrix 
END if 

END loop on A 

Calculate the coupling matrix 
Store on disk the coupling matrix 
GOTO 10 

In the above scheme, when the slave process is assigned a task to perform, 
it computes the primitive basis set for the current sector, from which eigenval- 
ues and surface functions at p^ are calculated by solving equation 0). Related 
eigenvectors are stored on disk. Surface functions for the previous sector are 
reconstructed by retrieving from disk (where they are written by the node per- 
forming the calculation for the previous sector) related eigenvectors without 
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solving again equation Q). To prevent attempts to read not yet stored informa- 
tion, nodes are synchronized before reading from disk in order to ensure that 
related writing operations have been completed. This is accomplished by defin- 
ing an MPI communicator that groups together all the slaves and by putting an 
MPI_BARRIER. After reading the necessary information from disk, sector calcula- 
tions are completed by evaluating the coupling matrix. 

Another aspect that has been considered for the optimization of the parallel 
model is the management of output files. In the sequential version of the program 
only two files (one for the whole coupling matrix and one for the whole overlap 
matrix) containing the information of all sectors are generated. In the parallel 
program writing operations of the various sectors are decoupled by associating 
an individual file to each sector. 

The serial version of LOGDER consists of two nested loops: the outer loop 
runs on the energy values at which propagation must be performed, the inner 
loop runs on sectors. For each sector the propagator integrates one step forward 
the set of coupled differential equations given in 0: 

Read input data 

LOOP on energy E 
LOOP on sectors 
Calculate 

Call the propagator 
END loop on sectors 
Store the solution matrix 

END loop on energy E 

At the end of the propagation through all the sectors, the solution matrix is 
stored on disk for use by subsequent programs, and the propagation for another 
energy is started. 

The most natural way of parallelizing LOGDER is to adopt a task farm 1131 
at the level of the loop on energy. Accordingly, the scheme of the master process 
is: 



Read input data 
Send input data to all slave 
LOOP on energy E 
Call MP1_SEND(E) 

END loop on energy E 

The process sketched above reads and broadcasts the input data to all slave 
processes. Then the work is assigned to the workers by sending the current energy 
value. The scheme of the slave process is: 

Recv input data 
10 Call MP1_RECV(E) 

LOOP on sectors 
Calculate 

Call the propagator 
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END loop on sectors 
Store the solution matrix 
GOTO 10 

The slave process receives firstly the input data and then receives the energy 
value E for which the propagation has to be performed. At the end of the prop- 
agation, the slave process stores on disk the solution matrix and gets ready to 
receive the next energy value. 

4 Performance Measurements 

Performances of the parallel versions of ABM and LOGDER where measured on 
the Cray T3E/1200 256 at CINECA (Bologna, Italy) using as input parameters 
those of the Li + FH reaction. The total angular momentum was set equal to 
zero (J = 0), the hyperradius was subdivided into 230 sectors and the surface 
functions were expanded using a basis set of 277 functions. 

Elapsed times (in seconds) measured for ABM on various machine configu- 
rations are shown in figure d In figure El the related speedup calculated using 
an estimate of the sequential time obtained by extrapolating the time measured 
for parallel runs to a single processor run (18203 s) is plotted. 




Fig. 1. Elapsed time measured for ABM. 



As is apparent from figured the program scales well, despite the fact that it 
is impossible to evenly distribute 230 sector calculations among the considered 
numbers of processors (elapsed times measured here are about 5 times smaller 
than those reported in ref. H2|). 
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Fig. 2. Limit (dashed line) and measured speedup (solid line) for ABM. 



I 15000 



20 30 40 

Number of processors 



Fig. 3. Elapsed time measured for LOGDER. 



The performance of the parallel version of LOGDER was measured by carry- 
ing out calculations for 126 energies on 64 nodes (individual node elapsed times 
are plotted in figure ©)• In this figure the mean node elapsed time is also given 
as a dashed line. As apparent from figure Q, the work load is almost evenly 
distributed among the nodes since the deviation of the individual node elapsed 
time from the mean value varies between -1-3.09% and -2.89%. The (small) im- 
balance is mainly due to the fact that, during the propagation, each slave process 
has to read 3 files from disk for each sector. This causes competition among the 
various processors and generates inefhciences depending on the disk I/O speed. 
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5 Conclusions 

The parallelization of complex computational procedures such as those devoted 
to full dimensional quantum reactive scattering calculations is a rather intriguing 
task. The study reported in this paper shows that to achieve high speedups the 
parallelism has to be kept at high level (coarse granularity) and in some cases 
calculations have to be repeated to decouple existing order dependencies. This 
is the case of ABM for which, to remove order dependencies of the sequential 
code the calculation of the surface functions has to be repeated and one has 
to manage several files stored on disks (this introduces a dependence from the 
speed of the disks). However, in our tests the performance of the program is not 
significantly penalized by this, owing to the high disk speed of the Cray machine 
used. 

The parallelization of LOGDER showed more natural since a dynamical as- 
signment of the work load can be adopted if the whole propagation (having a 
strict sequential nature) can be assigned to a single node. 
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Abstract. Initial versions of MPI were designed to work efficiently on multi- 
processors which had very little job control and thus static process models, sub- 
sequently forcing them to support dynamic process operations would have ef- 
fected their performance. As current HPC systems increase in size with higher 
potential levels of individual node failure, the need rises for new fault tolerant 
systems to be developed. Here we present a new implementation of MPI called 
FT-MPj|that allows the semantics and associated failure modes to be com- 
pletely controlled by the application. Given is an overview of the FT-MPI se- 
mantics, design and some performance issues as well as the HARNESS 
g_hcore implementation it is built upon. 



1, Introduction 

Although MPI is currently the de-facto standard system used to build high perform- 
ance applications for both clusters and dedicated MPP systems, it is not without it 
problems. Initially MPI was designed to allow for very high efficiency and thus per- 
formance on a number of early 1990s MPPs, that at the time had limited OS runtime 
support. This led to the current MPI design of a static process model. While this 
model was possible to implement for MPP vendors, easy to program for, and more 
importantly something that could be agreed upon by a standards committee. 

The MPI static process model suffices for small numbers of distributed nodes within 
the currently emerging masses of clusters and several hundred nodes of dedicated 
MPPs. Beyond these sizes the mean time between failure (MTBF) of CPU nodes start 
becoming a factor. As attempts to build the next generation Peta-flop systems ad- 
vance, this situation will only become more adverse as individual node reliability 
becomes out weighted by orders of magnitude increase in node numbers and hense 
node failures. 



' FT-MPI and HARNESS are supported in part by the US Department of Energy under 
contract DE-FG02-99ER25378. 

J. Dongarra et al. (Eds.): EuroPVM/MPI 2000, LNCS 1908, pp. 346-353, 2000. 
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The aim of FT-MPI is to build a fault tolerant MPI implementation that can survive 
failures, while offering the application developer a range of recovery options other 
than just returning to some previous check-pointed state. FT-MPI is built on the 
HARNESS meta-computing system [1]. 



2. Check-Point and Roll Back verse Replication Techniques 

The first method attempted to make MPI applications fault tolerant was through the 
use of check-pointing and roll back. Co-Check MPI [2] from the Technical University 
of Munich being the first MPI implementation built that used the Condor library for 
check-pointing an entire MPI application. In this implementation, all processes would 
flush their messages queues to avoid in flight messages getting lost, and then they 
would all synchronously check-point. At some later stage if either an error occurred or 
a task was forced to migrate to assist load balancing, the entire MPI application would 
be rolled back to the last complete check-point and be restarted. This systems main 
drawback being the need for the entire application having to check-point synchro- 
nously, which depending on the application and its size could become expensive in 
terms of time (with potential scaling problems). A secondary consideration was that 
they had to implement a new version of MPI known as tuMPI as retro-fitting MPICH 
was considered too difficult. 

Another system that also uses check-pointing but at a much lower level is StarFish 
MPI [3]. Unlike Co-Check MPI which relies on Condor, Starfish MPI uses its own 
distributed system to provide built in check-pointing. The main difference with Co- 
Check MPI is how it handles communication and state changes which are managed by 
StarFish using strict atomic group communication protocols built upon the Ensemble 
system [4], and thus avoids the message flush protocol of Co-Check. Being a more 
recent project StarFish supports faster networking interfaces than tuMPI. 

The project closest to FT-MPI known by the author is the unpublished Implicit Fault 
Tolerance MPI project by Paraskevas Evripidou of Cyprus University. This project 
supports several master-slave models where all communicators are built from grids 
that contain ‘spare’ processes. These spare processes are utilized when there is a fail- 
ure. To avoid loss of message data between the master and slaves, all messages are 
copied to an observer process, which can reproduce lost messages in the event of any 
failures. This system appears only to support SPMD style computation and has a high 
overhead for every message. 



3. FT-MPI Semantics 

Current semantics of MPI indicate that a failure of a MPI process or communication 
causes all communicators associated with them to become invalid. As the standard 
provides no method to reinstate them (and it is unclear if we can even free them), we 
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are left with the problem that this causes MPI COMM WORLD itself to become 
invalid and thus the entire MPI application will grid to a halt. 

FT-MPI extends the MPI communicator states from {valid, invalid} to a range 
{FT OK, FT DETECTED, FT RECOVER, FT RECOVERED, FT FAILED}. In 
essence this becomes (OK, PROBLEM, FAILED}, with the other states mainly of 
interest to the internal fault recovery algorithm of FT MPI. Processes also have typi- 
cal states of (OK, FAILED} which FT-MPI replaces with (OK, Unavailable, Joining, 
Failed}. The Unavailable state includes unknown, unreachable or “we have not voted 
to remove it yef ’ states. 

A communicator changes its state when either an MPI process changes its state, or 
a communication within that communicator fails for some reason. The typical MPI 
semantics is from OK to Failed which then causes an application abort. By allowing 
the communicator to be in an intermediate state we allow the application the ability to 
decide how to alter the communicator and its state as well as how communication 
within the intermediate state behaves. 



3.1.1 Failure Modes 

On detecting a failure within a communicator, that communicator is marked as having 
a probable error. Immediately as this occurs the underlying system sends a state up- 
date to all other processes involved in that communicator. If the error was a communi- 
cation error, not all communicators are forced to be updated, if it was a process exit 
then all communicators that include this process are changed. Note, this might not be 
all current communicators as we support MPI-2 dynamic tasks and thus multiple 
MPICOMMWORLDS. 

How the system behaves depends on the communicator failure mode chosen by the 
application. The mode has two parts, one for the communication behavior and one for 
the how the communicator reforms if at all. 



3.1.2 Communicator and Communication Handling 

Once a communicator has an error state it can only recover by rebuilding it, using a 
modified version of one of the MPI communicator build functions such as 
MPI_Comm_ (create, split or dup}. Under these functions the new communicator will 
follow the following semantics depending on its failure mode: 

SHRINK: The communicator is shrank so that there are no holes in its data 

structures. The ranks of the processes are changed, forcing the application to recall 
MPICOMMRANK. 

BLANK: This is the same as SHRINK, except that the communicator can now 

contain gaps to be filled in later. Communicating with a gap will cause an invalid rank 
error. Note also that calling MPI COMM SIZE will return the size of the communi- 
cator, not the number of valid processes within it. 
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REBUILD: Most complex in that it forces the creation of new processes to fill 

any gaps. The new processes can either be places in to the empty ranks, or the com- 
municator can be shrank and the processes added the end. This is used for applications 
that require a certain size to execute as in power of two FFT solvers. 

ABORT : Is a mode which effects the application immediately an error is de- 

tected and forces a graceful abort. The user can not trap this, and only option is to 
change the communicator mode to one of the above modes. 

Communications within the communicator are controlled by a message mode for 
the communicator which can be either of: 

NOP: No operations on error. I.e. no user level message operation are al- 

lowed and all simply return an error code. This is used to allow an application to re- 
turn from any point in the code to a state where it can take appropriate action as soon 
as possible. 

CONT: All communication that is NOT to the effected/ failed node can con- 

tinue as normal. Attempts to communicate with a failed node will return errors until 
the communicator state is reset. 

The user discovers any errors from the return code of any MPI call, with a new 
fault indicated by MPI ERR OTHER. Details as to the nature and specifics of the 
error is available though the cached attributes interface in MPI. 



3.1.3 Point to Point verses Collective Correctness 

Although collective operations pertain to point to point operations in most cases, extra 
care has been taken in implementing the collective operations so that if an error occurs 
during an operation, the result of the operation will still be the same as if there had 
been no error, or else the operation is aborted. 

Broadcast, gather and all gather demonstrate this perfectly. In Broadcast even if 
there is a failure of a receiving node, the receiving nodes still receive the same data, 
i.e. the same end result for the surviving nodes. Gather and all-gather are different in 
that the result depends on if the problematic nodes sent data to the gatherer/root or 
not. In the case of gather, the root might or might not have gaps in the result. For all 
gather which typically uses a ring algorithm it is possible that some nodes may have 
complete information and others incomplete. Thus for operations that require multiple 
node input as in gather/reduce type operations any failure causes all nodes to return an 
error code, rather than possibly invalid data. Currently an addition flag controls how 
strict the above rule is enforced by utilizing an extra barrier call at the end of the col- 
lective call if required. 




350 



Graham E. Fagg and Jack J. Dongarra 



4. FT-MPI Usage Example 



Typical usage of FT-MPI would be in the form of an error check and then some cor- 
rective action such as a communicator rebuild. A typical code fragment is shown be- 
low, where on an error the communicator is simply rebuilt and reused: 

rc= MPI_Send { , com) ; 

If {rc==MPI_ERR_OTHER) 

MPI_Comm_dup {com, newcom) ; 
com = newcom; /* continue.. */ 

Some types of computation such as SPMD master-slave codes only need the error 
checking in the master code if the user is willing to accept the master as the only point 
of failure. The example below shows how complex a master code can become. In this 
example the communicator mode is BLANK and communications mode is CONT. 
The master keeps track of work allocated, and on an error just reallocates the work to 
any ‘free’ surviving processes. Note, the code checks to see if there are surviving 
worker processes left after each death is detected. 

rc = MPI_Bcast ( initial_work... . ) ; 

if (rc = =MPI_ERR_OTHER) reclaim_lost_work (...) ; 

while { ! all_work_done) { 
if (work_allocated) { 

rc = MPI_Recv { buf, ans_size, result_dt, 

MPI_ANY_SOURCE, MPI_ANY_TAG, comm, ^status) ; 
if {rc==MPI_SUCCESS) { 

handle_work (buf) ; 

free_worker (status .MPI_SOURCE) ; 

all_work_done- - ; 

else { 

reclaim_lost_work (status .MPI_SOURCE) ; 
if {no_surviving_workers ) { /* ! do something ! */ } 



} /* work allocated */ 

/* Get a new worker as we must have received a result or a death */ 
rank=get_f ree_worker_and_allocate_work ( ) ; 
if (rank) { 

rc = MPI_Send {... rank... ) ; 

if (rc==MPI_OTHER_ERR) reclaim_lost_work (rank) ; 

if {no_surviving_workers ) { /* ! do something ! */ } 

} /* if free worker */ 

} /* while work to do */ 



5. FT MPI Implementation Details 



FT-MPI is a partial MPI-2 implementation in its own right. It currently contains sup- 
port for both C and Fortran interfaces, all the MPI-1 function calls required to run 
both the PSTSWM [6] and BLACS applications. BLACS is supported so that 
SCALAPACK application can be tested. Currently only some the dynamic process 
control functions from MPI-2 are supported. 
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The current implementation is built as a number of layers as shown in figure 1 . Oper- 
ating system support is provided by either PVM or the C Harness gjicore. Although 
point to point and collective communication is provided by the stand alone 
SNIPE Lite communication library taken from the SNIPE project [4]. 



C-F Interface handling 



Attribute / data structures and communicator state handling 



Derived Types 


Buffer Ivfenagement 


Collective Library 


P2P driver 



Failure handler 



MultiThreaded SNIPE Lite 
Comms Library 



OS support layer 
(process control / naming / 
failure detection) 



TCP/UDP 


Shmem 


GM/BIP 


VIA 



HARNESS 

g_hcore 



Fig. 1. Overall structure of the FT-MPI implementation. 



A number of components have been extensively optimised, these include: 

• Derived data types and message buffers. Particular attention has been paid in 
improving sparse data set and numeric representation handling. 

• Collective communications. They have been tuned for both optimal topologies 
(ring verse binary vs binomial trees) as well as dynamic re-ordering of topologies. 

• Point to point communication using a multi -threaded SNIPE Lite library that’s 
allows separate threads to handle send and receives so that non-blocking commu- 
nications still make progress while not within any MPI calls. 

It is important to note that the failure handler gets notification of failures from both the 
communications libraries as well as the OS support layer. In the case of communica- 
tion errors this is usually due to direct communication with a failed party fails before 
the failed parties OS layer has notified other OS layers and their processes. The han- 
dler is responsible for notifying all tasks of errors as they occur by injecting notify 
messages into the send message queues ahead of user level messages. 






352 



Graham E. Fagg and Jack J. Dongarra 



6, OS Support and the Harness g hcore 

When FT-MPI was first designed the only Harness Kernel available was an experi- 
ment Java implementation from Emory University [5]. Tests were conducted to im- 
plement required services on this from C in the form of C-Java wrappers that made 
RMl calls. Although they worked, they were not very efficient and so FT-MPl was 
instead developed using the readily available PVM system. 

As the project has progressed, the primary author developed the g hcore, a C based 
HARNESS core library that uses the same policies as the Java version. This core 
allows for services to be built that FT-MPl requires. 

The g hcore library and daemon process (g hcore d) has good performance com- 
pared to the Java core especially in a LAN environment when using UDP, with remote 
function invocation times of 400uSeconds compared to several millisecond for Java 
RMl between remote JVMs running on Linux over lOOMb/Sec Ethernet. 



Current services required by FT-MPl break down into three categories: 

1. Meta-Data storage. Provided by PVM in the form of message mboxes. Under 
the g hcore as a multi-master master-slave replicated store. 

2. Process control (spawn, kill). Provided using pvm spawn and pvm kill for 
PVM, and fork-exec and signal under the g hcore d. 

3. Task exit notification, pvm notify and pvm_probe under PVM, and via the 
spawn service under g hcore catching Unix sigchild and broken sockets. 



7. FT-MPI Tool Support 

Current MPl debuggers and visualization tools such as totalview, vampir, upshot etc 
do not have a concept of how to monitor MPl jobs that change their communicators 
on the fly, nor do they know how to monitor a virtual machine. To assist users in un- 
derstanding these the author has implemented two monitor tools. HOSTINFO which 
displays the state of the Virtual Machine. COMINFO which displays processes and 
communicators in colour coded fashion so that users know the state of an applications 
processes and communicators. Both tools are currently built using the XI 1 libraries 
but will be rebuilt using the Java SWING system to aid portability. An example dis- 
plays during a SHRINK communicator rebuild operation is shown in figure 2, where a 
process (rank 1) has just exited. 
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HARNESS FT_MPI Virtual Machine Communicator Infomation 
communicator: MPI_C0MM_U0RLIi num procs: 2 MPI size: 3 

Rank III 0 2 

Proc Status 

Proc id 0x8001 0x8003 



Fig. 2. COMFNFO display for an application with an exited process. Note that the number of 
nodes and size of communicator do not match. 



8, Conclusions 

FT-MPI is an attempt to provide application programmers with different methods of 
dealing with failures within MPI application than just check-point and restart. It is 
hoped that by experimenting with FT-MPI, new applications methodologies and algo- 
rithms will be developed to allow for both high performance and the survivability 
required by the next generation of terra- flop and beyond machines. 

FT-MPI in itself is already proving to be a useful vehicle for experimenting with self- 
tuning collective communications, distributed control algorithms and improved sparse 
data handling subsystems, as well as being the default MPI implementation for the 
HARNESS project. 
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Abstract. The performance of the MPl's collective communications is critical 
in most MPl-based applications. A general algorithm for a given collective 
communication operation may not give good performance on all systems due to 
the differences in architectures, network parameters and the buffering scheme of 
the underlying MPl implementation. In this paper, we discuss an approach in 
which the collective communications are tuned for any given system by con- 
ducting a series of experiments on the system. We also discuss a dynamic to- 
pology method that uses the tuned static topology shape, but re-orders the logi- 
cal addresses to compensate for changing run time variations. A series of ex- 
periments were conducted comparing our tuned MPl Bcast to various native 
vendor MPl implementations. The results obtained were encouraging, and show 
that our implementations of collective algorithms can significantly improve the 
performance of current MPl implementations. 



1. Introduction 

This project grew out of an attempt to build efficient collective communications for a 
new fault tolerant MPl implementation known as HARNESS FT-MPI [9], but as it 
developed was found to be applicable to other current MPl implementations. This 
project differs from at least two different efforts that have been made in the past to 
improve the performance of the MPl collective communications for a given system. 
They either dealt with the collective communications for a specific system or tried to 
tune the collective communications for a system based purely on mathematical models 
or both. Lars Paul Huse's paper on collective communications [2] studied and com- 
pared the performance of different collective algorithms on SCI based clusters. MAG- 
PIE by Thilo Kielman et. al. [I] optimizes collective communications for clustered 
wide area systems. Though MAGPIE tries to find the optimum buffer size and opti- 
mum tree shape for a given collective communication on a given system, these opti- 
mum parameters are determined using a performance model called the parametrized 
LogP model. Mathematical models based on few network parameters in the system do 
not adequately take into account the overlap in communication that occurs in collec- 
tive communications. 
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In this paper, we discuss an approach in which the optimum algorithm and optimum 
buffer size for a given collective communication on a system is determined by con- 
ducting experiments on the system. This approach follows the strategy that is used in 
efforts like ATLAS [7] for matrix operations and FFTW [6] for Fast Fourier Trans- 
forms. The experiments were conducted in several phases. In the first phase, the best 
buffer size for a given algorithm for a given number of processors is determined by 
evaluating the performance of the algorithm for different buffer sizes. In the second 
phase, the best algorithm for a given message size is chosen by repeating phase 1 with 
a known set of algorithms and choosing the algorithm that gives the best result. In the 
third phase, phase 1 and phase2 is repeated for different number of processors. 

The large number of buffer sizes and the large number of processors significantly 
increase the time for conducting the above experiments. While testing different buffer 
sizes, only values that are power of 2 and multiples of the basic data type are evalu- 
ated. Similarly, the experiments are conducted for only "useful" number of processors. 
Work is under way to reduce the number of experiments and still achieve good opti- 
mization of the collective communications. 

In Section 2, we examine the different algorithms that are available in our reper- 
toire. In Section 3, we describe the machines we used, the experiments conducted on 
the machines, and analysis of the results. In Section 4, we discuss the dynamic topol- 
ogy method that reorders the processes within a given topology for communication 
and methods for reducing the total search space examined. In Section 5, we present 
some conclusions. Finally in Section 6, we outline the future direction of our research. 



2, Algorithms for Collective Communications 

The first crucial step in our effort is to develop a range of competitive algorithms for 
efficient collective communications over different topologies and network infrastruc- 
tures. In this section, we describe the different algorithms used in our experiments. 

Developing competent algorithms for broadcast, scatter and gather is significant 
since the other collective communication operations can be implemented with the 
combination of these three collective operations. 

• Sequential tree: In this topology, the root sends the messages successively to all 
the other processors. If there are n processors, this algorithm takes n-1 steps to 
complete. Since the latencies are not chained, this algorithm gives good perform- 
ance in wide-area networks. 

• Chain tree: In the chain tree, the root sends to process I and process N-I receives 
from N-2. Process re [I. . . N-2] receives from r-I and sends to r-l-I. Though 
process N must wait for N-I time steps for the reception of the message, the pipe- 
lined nature of the algorithm gives successive operations high throughput. 

• Binary free: Each node but the root receives from one node, and all sends to up to 
two other nodes. This algorithm takes at most 0(log2N) steps to complete. 

• Binomial tree : The definition of the binomial tree as given in the paper by Laurs 
Paul Huse is "In s g [I . . . Ln] steps, process 0 in all groups send to rx = 
L(2-l-maxr)/2j which receive from 0. All groups with more than two processes are 
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then split in O = [0. . .Tx-I] and ^ = [r^ . . .maxr] and new ranks r' = r-r^ assigned to 

xp M 

In most cases, the binomial tree algorithm gives better performance than the binary 
tree. 



3, Experimental Setup and Results 

The experiments consist of many phases. In the first phase, we determine the best 
segment size for a given message size for a given algorithm for a collective operation. 
The segment sizes are powers of two, multiples of the basic data type and less than the 
message size. Having conducted the first phase for all the algorithms, we determine 
the best algorithm for a collective operation for a given message size. Message sizes 
from the size of the basic data type to 1MB were evaluated. This forms the second 
phase of the experiments. Though we have conducted the experiments on only eight 
processors, the third phase of the experiments would be to evaluate the results on a set 
of different number of processors. The number of processors will be power of two and 
less than the available number of processors. Our current effort is in reducing the 
search space involved in each of the above phases and still be able to get valid conclu- 
sions. The experiments were conducted on multiple systems including: 

• 143-MHz UltraSPARC systems using 100 Mbps Ethernet 

• dual processor (300/450 MHz) and single processor (450/600 MHz) Linux/NT 
machines connected by 100Mbit Ethernet, Giganet and Myrinet interconnec- 
tions 

• 34 node IBM SP2 system consisting of two eight way SMP high nodes and 32 
thin nodes running AIX 4.2. 

Figure 1 shows the results on the Intel machines with eight processors running Li- 
nux and interconnected by lOOMbs Ethernet. Because of the Fast Ethernet link, the 
overhead associated with the communication dominates the gap times [3]. Since in 
binary and ring algorithms, a processor communicates with only few other processors, 
these algorithms are able to utilize the gap values more efficiently than the other algo- 
rithms. Hence these algorithms combined with message segmenting help in improving 
the performance over the default MPICH. The MPICH default binomial algorithm 
does not give a good performance on the Intel machines since a processor does not 
immediately send the next segment of a message to another processor as soon as the 
first segment is sent. 
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broadcasl (lore) 




Fig. 1. Broadcast performance for an Intel Linux Cluster. 

Experiments were also conducted on the IBM SP2 system using both its Power2 
thin nodes and its SMP high nodes. The MPI collective algorithms were implemented 
on top of the IBM vendor MPI. The performance of the collective algorithms on the 
IBM thin nodes are shown in figure 2. 
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Fig. 2. Performance of the IBM SP2 thin nodes. 

The superior performance of the communication adapter results in very small gap 
values. Hence the binary and the ring algorithms combined with message segmenta- 
tion give better performance than the IBM MPI algorithm for message sizes larger 
than 8K bytes. 
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Figure 3 show the results on high node 8-way SMPs. IBM MPI sends and receives 
take place through the communication adapter. This results in large gap values for 
communication between nodes on a SMP. These gap values are utilized by the overlap 
in communication in binomial algorithms. This results in superior performance over 
IBM MPI which tries to use the same algorithm for communication on both thin and 
high nodes. Thus different algorithms have to be used on the same system for different 
memory models. 



broGdcasi (IBM high nodes) 




Fig. 3. Performance on the IBM high SMP nodes. 



4, Dynamical Reordering of Topologies and Reduced Search 
Space 

Most systems rely on all processes in a communicator or process group entering the 
collective communication call synchronously for good performance, i.e. all processes 
can start the operation without forcing others later in the topology to be delayed. 
There are some obvious cases where this is not the case: 

The application is executed upon heterogeneous computing platforms where the 
raw CPU power varies (or load balancing is not optimal). 

The computational cycle time of the application can be non-deterministic as is the 
case in many of the newer iterative solvers that may converge at different rates con- 
tinuously. 

Even when the application executes in a regular pattern, the physical network char- 
acteristics can cause problems with the simple logP model, such as when running 
between dispersed clusters. This problem becomes even more acute when the system 
latency is so low, that any buffering, while waiting for slower nodes, drastically 
changes performance characteristics as is the case with BIP-MPI [8]. 
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4.1 Dynamic Methodology 

This method is a modification of the previous tuned method, where we use the tuned 
topology as a starting point, but the behavior of the method is varied between actual 
uses of the collective operations at run-time. 

The method forces all the non-root nodes to send a small start-acknowledge 
(SACK) message to the root node, which the root uses to builds a mapping from 
communicator rank to logical address within the chosen topology dynamically. Each 
process, after having sent its SACK, then receives its own topology information via 
the root directly or by piggy backing the information on a user data message depend- 
ing on the MPI operation being performed. This information can be split into multiple 
messages such as from whom do they receive from, and whom do they send to, as the 
information becomes available, i.e. a process might not be a leaf node in the tree to- 
pology but still receive all their data before knowing whom to send to. 

Figure 4 demonstrates this methodology. Case 1 is where all processes within the 
tree are ready to run immediately and thus performance is optimal. In Case 2, both 
processes B and C are delayed and initially the root A can only send to D. As B and C 
become available, they are added to the topology. At this point we have to choose 
whether to add the nodes depth first as in Case 2a or breadth first as in Case 2b. Cur- 
rently breadth first has given us the best results. Also note that in CASE 1, if process 
B is not ready to receive, it effects not only its own sub-tree, but depending on the 
message/segment size, it is possible that it would block any other messages that A 
might send, such as to Ds sub-tree etc. Faster network protocols might not implement 
non-blocking sends in a manner that could overcome this limitation without effecting 
the synchronous static optimal case, and thus blocking send are often used instead. 
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A A 
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Case 2b 
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Fig. 4. Re-ordered topologies of a message tree 

Currently we are testing the cost of overhead incurred in using this technique for 
different network infrastructures. We are also exploring the conditions needed for the 
automatic use of this technique during the course of the computation. Initial results 
have been promising, especially for large messages and network interfaces with very 
low latency, that rely on the receivers to have already posted receives to allow DMA 
message transfers. Worst case results have been equivalent to the overhead for n-\ 
small message send/receives. Best case has been within a few percent of optimal 
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where no re-ordering on the same example has produced multiples of the optimal wall 
clock times, although this varies with the operation, number of processors, data size 
and level of initial synchronization. 



4.2 Reducing the Search Space 

Initial efforts for reducing the search space needed to get a close to optimal result are 
focused around using domain specific knowledge such as the shape of time verses 
segment size function as shown in figure 5. In this case we have significantly reduced 
total time by using hill decent algorithms that start at the maximal segment size rather 
than a linear search across all possible segment sizes. 



scatter (cetus), 1 28k, 6 procs 




Seam ent size 



Fig. 5. Shape of scatter operation for various segment sizes. 



5, Conclusion 

The optimal algorithm and the optimal buffer size for a given message size depends 
on a given configuration of the system including the gap values of the networks, mem- 
ory models, the underlying communication layer etc. The optimal parameters for a 
system can be best determined by conducting experiments on the system. Our results 
show that the optimal parameters obtained from the experiments gave better perform- 
ance than some native MPI implementations which implement a single algorithm 
irrespective of the system parameters. The randomness of our results for a given sys- 
tem also show that a generalized mathematical model will often not be able to give 
optimal performance. 

We have also shown that during application execution, dynamically altering the 
mapping between rank and position within a topology can yield additional benefits in 
terms of performance. 
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6, Future Work 

The research is still in its preliminary stage of development. More competent algo- 
rithms for other collective communications have to be implemented. One of our pri- 
mary goals in the future will be to conduct less experiments and still be able to obtain 
optimal performance for a given message size and a given number of processors. 
When complete, ACCT will be released as a standalone (MPI Profiling) library that 
can be used to improve any currently available MPI implementation. 
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